Book Image

ElasticSearch Server

Book Image

ElasticSearch Server

Overview of this book

ElasticSearch is an open source search server built on Apache Lucene. It was built to provide a scalable search solution with built-in support for near real-time search and multi-tenancy.Jumping into the world of ElasticSearch by setting up your own custom cluster, this book will show you how to create a fast, scalable, and flexible search solution. By learning the ins-and-outs of data indexing and analysis, "ElasticSearch Server" will start you on your journey to mastering the powerful capabilities of ElasticSearch. With practical chapters covering how to search data, extend your search, and go deep into cluster administration and search analysis, this book is perfect for those new and experienced with search servers.In "ElasticSearch Server" you will learn how to revolutionize your website or application with faster, more accurate, and flexible search functionality. Starting with chapters on setting up your own ElasticSearch cluster and searching and extending your search parameters you will quickly be able to create a fast, scalable, and completely custom search solution.Building on your knowledge further you will learn about ElasticSearch's query API and become confident using powerful filtering and faceting capabilities. You will develop practical knowledge on how to make use of ElasticSearch's near real-time capabilities and support for multi-tenancy.Your journey then concludes with chapters that help you monitor and tune your ElasticSearch cluster as well as advanced topics such as shard allocation, gateway configuration, and the discovery module.
Table of Contents (17 chapters)
ElasticSearch Server
Credits
About the Authors
Acknowledgement
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Index

Dynamic mappings and templates


The previous topic described how we can define type mapping if the mapping generated automatically by ElasticSearch is not sufficient. Now let's go one step back and see how automatic mapping works. Knowledge about this prevents surprises during development of your applications and let's you build more flexible software. In this second case, if sometimes our application grows and automatically generates new indexes (for example, for storing a massive number of time-based events), it is more convenient to adjust the mechanism of determining the data types. Also, if an application has many indexes, the possibility of defining the mapping templates is very handy.

Type determining mechanism

ElasticSearch can guess the document structure by looking at the JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words and numbers are just a few digits. This is a simple trick, but it usually works. For the following document:

{
  "field1": 10,
  "field2": "10"
}

field1 will be guessed as a long type, but field2 will be determined as a string. The other numeric types are guessed similarly. Of course, this can be a desired behavior, but sometimes the data source may omit the type information and everything may be presented as strings. The solution to this is enabling more aggressive text checking in the mapping definition. For example, we may do the following during index creation:

curl -XPUT http://localhost:9200/blog/?pretty -d '{
  "mappings" : {
    "article": {
      "numeric_detection" : true
    }
  }
}'

Unfortunately, this problem is also true for the Boolean type and there is no option to force guessing Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition.

Another type that causes trouble is date. ElasticSearch tries to guess the dates given as timestamps or strings that match the date format. Fortunately, a list of recognized formats can be defined as follows:

curl -XPUT http://localhost:9200/blog/?pretty -d '{
  "mappings" : {
    "article" : {
      "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]
    }
  }
}

As in the previous example, the preceding command shows the mappings definition during index creation. Analogically, this works in the PUT mapping API call of ElasticSearch. The format of the data definition is determined by the ones used in the joda-time library (visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html). As you can see, this allows you to adapt to almost any format that can be used in the input document. Note that dynamic_date_format is an array. This means that we can handle several date formats simultaneously.

Now we know how ElasticSearch guesses what is in our document. The important information is that a server can guess that for any new document. Let's check this simple case of how it can deal with changes:

curl -XPUT localhost:9200/objects/obj1/1?pretty -d '{ "field1" : 254}'

Now we have a new index called objects with a single document in it—a document with only a single field. This is obviously a number, isn't it? So let's query ElasticSearch and retrieve the automatically generated mappings:

curl -XGET localhost:9200/objects/_mapping?pretty

And the reply is as follows:

{
  "objects" : {
    "obj1" : {
      "properties" : {
        "field1" : {
          "type" : "long",
          "ignore_malformed" : false
        }
      }
    }
  }
}

No surprise here, we got what we expected (more or less). Now let's try something different—the second document with the same field name, but another value:

curl -XPUT localhost:9200/objects/obj1/2?pretty -d '{
 "field1" : "one hundred and seven"
}'

And the reply is as follows:

{
  "error" : "MapperParsingException[Failed to parse [field1]]; 
  nested: NumberFormatException[For input string: 
  \"one hundred and seven\"]; ",
  "status" : 400
}

It doesn't work. ElasticSearch assumes the field1 field as a number, and successive documents must fit into this assumption. To be sure, let's have one more try:

curl -XPUT localhost:9200/objects/obj1/2?pretty -d '{
 "field1" : 12.2
}'

Now that we have tried to index a document with a number, but a number of a different type, it succeeded. If we query for the mappings, we will notice that the type hasn't been changed. ElasticSearch silently changed our value and truncated the fractional part. It's not good, but this can happen when the input data is not so good (it usually isn't) and this is why we sometimes want to turn off automatic mapping generation. Another reason for turning it off is a situation when we don't want to add new fields to an existing index—fields that were not known during application development. To turn off automatic field adding, we can set the dynamic property to false, as follows:

{
  "objects" : {
    "obj1" : {
      "dynamic" : "false",
      "properties" : {
      ...
      }
    }
  }
}

Dynamic mappings

Sometimes we want to have the possibility of different type determination dependent on situations such as the field name and type defined in JSON. This is the situation in which dynamic templates can help. Dynamic templates are similar to the usual mappings. Each template has its pattern defined, which is applied to the document's field names. If a field matches the pattern, the template is used. The pattern can be defined in a few ways:

  • match: The template is used if the name of the field matches the pattern.

  • unmatch: The template is used if the name of the field doesn't match the pattern.

By default, the pattern is very simple and allows us to use the asterisk character. This can be changed by using match_pattern=regexp. After using this option, we can use all the magic provided by regular expressions.

There are variations such as path_match and path_unmatch that can be used to match the names in nested documents.

When writing a target field definition, the following variables can be used:

  • {name}: The name of the original field found in the input document

  • {dynamic_type}: The type determined from the original document

The last important bit of information is that ElasticSearch checks templates in order of their definitions and the first matching template is applied. This means that the most generic templates (for example, with "match": "*") should be defined at the end. Let's have a look at the following example:

{
  "mappings" : {
    "article" : {
      "dynamic_templates" : [
        {
          "template_test": {
            "match" : "*",
            "mapping" : {
              "type" : "multi_field",
              "fields" : {
                "{name}": { "type" : "{dynamic_type}"},
                "str": {"type" : "string"}
              }
            }
          }
        }
      ]
    }
  }
}

In the preceding example, we defined a mapping for the article type. In this mapping, we have only one dynamic template named template_test. This template is applied for every field in the input document because of the single asterisk pattern. Each field will be treated as a multi_field, consisting of a field named as the original field (for example, title) and the second field with the same name as the original field, suffixed with str (for example, title.str). The first of the created fields will have its type determined by ElasticSearch (with the {dynamic_type} type) and the second field will be a string (because of the string type).

Templates

As we have seen earlier in this chapter, the index configuration, and mappings in particular, can be complicated beasts. It would be very nice if there was a possibility of defining one or more mappings once and then using them in every newly created index, without the need to send them every time. ElasticSearch's creators predicted this and included a feature called index templates. Each template defines a pattern, which is compared to the newly created index name. When both match, the values defined in the template are copied to the index structure definition. When multiple templates match with the newly created index name, all of them are applied and values from the later applied templates override those defined in the previously applied templates. This is very convenient, because we can define a few common settings in the more general templates and change them into more specialized ones. Additionally, there is an order parameter, which lets us force desired template ordering. You can think of templates as dynamic mappings, which can be applied not to the types in documents, but to the indexes.

Let's see a real example of a template. Imagine that we want to create several indexes where we don't want to store the source of the documents so that the indexes will be smaller. We also don't need any replicas. The templates can be created by calling ElasticSearch REST API and an example cURL command would be similar to the following:

curl -XPUT http://localhost:9200/_template/main_template?pretty -d '
{
  "template" : "*",
  "order" : 1,
  "settings" : {
    "index.number_of_replicas" : 0
  },
  "mappings" : {
    "_default_" : {
      "_source" : {
        "enabled" : false
      }
    }
  }
}'

From now on, all created indexes will have no replicas and no source stored. Note the _default_ type name in our example. This is a special type name indicating that the current rule should be applied to every document type. The second interesting thing is the order parameter. Lets define the next template with the following command:

curl -XPUT http://localhost:9200/_template/ha_template?pretty -d '
{
  "template" : "ha_*",
  "order" : 10,
  "settings" : {
    "index.number_of_replicas" : 5
  }
}'

All new indexes will behave as before except the ones with the names beginning with ha_. In this case, both the templates are applied. First, the template with the lower order is used and then, the next template overwrites the replicas setting. So, these indexes will have five replicas and disabled source storage.

There is one more important thing about this example. If we try to create a document with five replicas and we have only a single node in the cluster, it will probably fail after some time and display a message similar to the following:

{
  "error" : "UnavailableShardsException[[ha_blog][2] [6] shardIt, 
  [1] active : Timeout waiting for [1m], request: index 
  {[ha_blog][article][1], source[\n{\n  \"priority\" : 1,\n  
  \"title\" : \"Test\"\n}]}]",
  "status" : 503
}

This is because ElasticSearch tries to create multiple copies of each of the shards of which the index is built, but this only makes sense when each of these copies can be placed on different server instances.

Storing templates in files

Templates can also be stored in files. By default, the files should be placed in the config/templates directory. For example, our ha_template should be placed in the config/templates/ha_template.json file and have the following contents:

{
  "ha_template" : {
    "template" : "ha_*",
    "order" : 10,
    "settings" : {
      "index.number_of_replicas" : 5
    }
  }
}

Note that the structure of the JSON is a little bit different and has the template name as the main object key. The second important thing is that the templates must be placed in every instance of ElasticSearch. Also, the templates defined in the files are not available with the REST API calls.