Book Image

ElasticSearch Server

Book Image

ElasticSearch Server

Overview of this book

ElasticSearch is an open source search server built on Apache Lucene. It was built to provide a scalable search solution with built-in support for near real-time search and multi-tenancy.Jumping into the world of ElasticSearch by setting up your own custom cluster, this book will show you how to create a fast, scalable, and flexible search solution. By learning the ins-and-outs of data indexing and analysis, "ElasticSearch Server" will start you on your journey to mastering the powerful capabilities of ElasticSearch. With practical chapters covering how to search data, extend your search, and go deep into cluster administration and search analysis, this book is perfect for those new and experienced with search servers.In "ElasticSearch Server" you will learn how to revolutionize your website or application with faster, more accurate, and flexible search functionality. Starting with chapters on setting up your own ElasticSearch cluster and searching and extending your search parameters you will quickly be able to create a fast, scalable, and completely custom search solution.Building on your knowledge further you will learn about ElasticSearch's query API and become confident using powerful filtering and faceting capabilities. You will develop practical knowledge on how to make use of ElasticSearch's near real-time capabilities and support for multi-tenancy.Your journey then concludes with chapters that help you monitor and tune your ElasticSearch cluster as well as advanced topics such as shard allocation, gateway configuration, and the discovery module.
Table of Contents (17 chapters)
ElasticSearch Server
Credits
About the Authors
Acknowledgement
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Index

Manual index creation and mappings configuration


So, we have our ElasticSearch cluster up and running and we also know how to use ElasticSearch REST API to index our data, delete it, and retrieve it, although we still don't know the specifics. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although ElasticSearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indexes (and how to delete them) and how to create mappings that suit your needs and match your data structure.

Note

Please note that we didn't include all the information about the available types in this chapter and some features of ElasticSearch (such as nested type, parent-child handling, geographical points storing, and search) are described in the following chapters of this book.

Index

An index is a logical structure in ElasticSearch that holds your data. You can imagine it as a database table that has rows and columns. A row is a document we index and a column is a single field in the index. Your ElasticSearch cluster can have many indexes inside it running at the same time. But that's not all. Because a single index is made of shards, it can be scattered across multiple nodes in a single cluster. In addition to that, each shard can have a replica—which is an exact copy of a shard—and is used to throttle search performance as well as for data duplication in case of failures.

All the shards that an index is made up of are, in fact, Apache Lucene indexes, which are divided into types.

Types

In ElasticSearch, a single index can have multiple types of documents indexed—for example, you can store blog posts and blog users inside the same index, but with completely different structures using types.

Index manipulation

As we mentioned earlier, although ElasticSearch can do some operations for us, we would like to create the index ourselves. For the purpose of this chapter, we'll use the index named posts to index the blog posts from our blogging platform. Without any more hesitation, we will send the following command to create an index:

curl –XPOST 'http://localhost:9200/posts'

We just told ElasticSearch that is installed on our local machine that we want to create the posts index. If everything goes right, you should see the following response from ElasticSearch:

{"ok":true,"acknowledged":true}

But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Because we have no data at all, we'll go for the simplest approach—we will just delete the index. To do that, we run a command similar to the preceding one, but instead of using the POST HTTP method, we use DELETE. So the actual command is as follows:

curl –XDELETE 'http://localhost:9200/posts'

And the response is very similar to what we got earlier:

{"ok":true,"acknowledged":true}

So now that we know what an index is, how to create it, and how to delete it, let's define the index structure.

Schema mapping

The schema mapping—or in short mappings—are used to define the index structure. As you recall, each index can have multiple types; but we will concentrate on a single type for now. We want to index blog posts that can have the following structure:

  • Unique identifier

  • Name

  • Publication date

  • Contents

So far, so good right? We decided that we want to store our posts in the posts index and so we we'll define the post type to do that. In ElasticSearch, mappings are sent as JSON objects in a file. So, let's create a mappings file that will match the previously mentioned needs—we will call it posts.json. Its contents are as follows:

{
  "mappings": {
    "post": {
      "properties": {                
        "id": {"type":"long", "store":"yes", 
        "precision_step":"0" },
        "name": {"type":"string", "store":"yes", 
        "index":"analyzed" },
        "published": {"type":"date", "store":"yes", 
        "precision_step":"0" },
        "contents": {"type":"string", "store":"no", 
        "index":"analyzed" }             
      }
    }
  }
}

And now to create our posts index with the preceding file, we need to run the following command:

curl -XPOST 'http://localhost:9200/posts' –d @posts.json

@posts.json allows us to tell the cURL command that we want to send the contents of the posts.json file.

Note

Please note that you can store your mappings and use a file named however you want.

And again, if everything goes well, we see the following response:

{"ok":true,"acknowledged":true}

We have our index structure and we can index our data, but we will take a pause now; we don't really know what the contents of the posts.json file mean. So let's discuss some details about this file.

Type definition

As you can see, the contents of the posts.json file are JSON objects and because of that, it starts and ends with curly brackets (if you want to learn more about JSON, please visit http://www.json.org/). All the type definitions inside the mentioned file are nested in the mappings object. Inside the mappings JSON object there can be multiple types defined. In our example, we have a single post type. But for example, if you would also like to include the user type, the file would look as follows:

{
  "mappings": {
    "post": {
      "properties": {                
        "id": { "type":"long", "store":"yes",
        "precision_step":"0" },
        "name": { "type":"string", "store":"yes", 
        "index":"analyzed" },
        "published": { "type":"date", "store":"yes", 
        "precision_step":"0" },
        "contents": { "type":"string", "store":"no", 
        "index":"analyzed" }             
      }
    },
    "user": {
      "properties": {                
        "id": { "type":"long", "store":"yes", 
        "precision_step":"0" },
        "name": { "type":"string", "store":"yes", 
        "index":"analyzed" }             
      }
    }
  }
}

You can see that each type is a JSON object and those are separated from each other by a comma character—like typical JSON structured data.

Fields

Each type is defined by a set of properties—fields that are nested inside the properties object. So let's concentrate on a single field now, for example, the name field, whose definition is as follows:

"contents": { "type":"string", "store":"yes", "index":"analyzed" }

So it starts with the name of the field, which is contents in the preceding case. After the name of the field, we have an object defining the behavior of the field. Attributes are specific to the types of fields we are using and we will discuss them in the next section. Of course, if you have multiple fields for a single type (which is what we usually have), remember to separate them with a comma character.

Core types

Each field type can be specified to a specific core type provided by ElasticSearch. The core types in ElasticSearch are as follows:

  • String

  • Number

  • Date

  • Boolean

  • Binary

So now, let's discuss each of the core types available in ElasticSearch and the attributes it provides to define their behavior.

Common attributes

Before continuing with all the core type descriptions I would like to discuss some common attributes that you can use to describe all the types (except for the binary one).

  • index_name: This is the name of the field that will be stored in the index. If this is not defined, the name will be set to the name of the object that the field is defined with. You'll usually omit this property.

  • index: This can take the values analyzed and no. For the string-based fields, it can also be set to not_analyzed. If set to analyzed, the field will be indexed and thus searchable. If set to no, you won't be able to search such a field. The default value is analyzed. In the case of the string-based fields, there is an additional option—not_analyzed, which says that the field should be indexed but not processed by the analyzer. So, it is written in the index as it was sent to ElasticSearch and only the perfect match will be counted during a search.

  • store: This can take the values yes and no, and it specifies if the original value of the field should be written into the index. The default value is no, which means that you can't return that field in the results (although if you use the _source field, you can return the value even if it is not stored), but if you have it indexed you still can search on it.

  • boost: The default value of this attribute is 1. Basically, it defines how important the field is inside the document; the higher the boost, the more important are the values in the field.

  • null_value: This attribute specifies a value that should be written into the index if that field is not a part of an indexed document. The default behavior will just omit that field.

  • include_in_all: This attribute specifies if the field should be included in the _all field. By default, if the _all field is used, all the fields will be included in it. The _all field will be described in more detail in Chapter 3, Extending Your Structure and Search.

String

String is the most basic text type, which allows us to store one or more characters inside it. A sample definition of such a field can be as follows:

"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }

In addition to the common attributes, the following ones can also be set for string-based fields:

  • term_vector: This can take the values no (the default one), yes, with_offsets, with_positions, or with_positions_offsets. It defines whether the Lucene term vectors should be calculated for that field or not. If you are using highlighting, you will need to calculate term vectors.

  • omit_norms: This can take the value true or false. The default value is false. When this attribute is set to true, it disables the Lucene norms calculation for that field (and thus you can't use index-time boosting).

  • omit_term_freq_and_positions: This can take the value true or false. The default value is false. Set this attribute to true, if you want to omit term frequency and position calculation during indexing. (Deprecated since ElasticSearch 0.20).

  • index_options: This allows to set indexing options. The possible values are docs which affects in number of documents for terms to be indexed, freqs which results in indexing number of documents for terms and term frequencies and positions which results in the previously mentioned two and term positions. The default value is freqs. (Available since ElasticSearch 0.20.)

  • analyzer: This is the name of the analyzer used for indexing and searching. It defaults to the globally defined analyzer name.

  • index_analyzer: This is the name of the analyzer used for indexing.

  • search_analyzer: This is the name of the analyzer used for processing the part of the query string that is sent to that field.

  • ignore_above: This is the maximum size of the field. The rest of the fields beyond the specified value characters will be ignored. This attribute is useful if we are only interested in the first N characters of the field.

Number

This is the core type that gathers all the numeric field types available to be used. The following types are available in ElasticSearch:

  • byte: A byte value; for example, 1

  • short: A short value; for example, 12

  • integer: An integer value; for example, 134

  • long: A long value; for example, 12345

  • float: A float value; for example, 12.23

  • double: A double value, for example, 12.23

A sample definition of a field based on one of the numeric types can be as follows:

"price" : { "type" : "float", "store" : "yes", "precision_step" : "4" }

In addition to the common attributes, the following ones can also be set for the numeric fields:

  • precision_step: This is the number of terms generated for each value in a field. The lower the value, the higher the number of terms generated, resulting in faster range queries (but a higher index size). The default value is 4.

  • ignore_malformed: This can take the value true or false. The default value is false. It should be set to true in order to omit badly formatted values.

Date

This core type is designed to be used for date indexing. It follows a specific format that can be changed and is stored in UTC by default.

The default date format understood by ElasticSearch is quite universal and allows us to specify the date and optionally the time; for example, 2012-12-24T12:10:22. A sample definition of a field based on the date type can be as follows:

"published" : { "type" : "date", "store" : "yes", "format" : "YYYY-mm-dd" }

A sample document that uses the preceding field can be as follows:

{ 
  "name" : "Sample document",
  "published" : "2012-12-22" 
}

In addition to the common attributes, the following ones can also be set for the date type- based fields:

  • format: This specifies the format of the date. The default value is dateOptionalTime. For a full list of formats, please visit http://www.elasticsearch.org/guide/reference/mapping/date-format.html.

  • precision_step: This specifies the number of terms generated for each value in that field. The lower the value, the higher is the number of terms generated, resulting in faster range queries (but a higher index size). The default value is 4.

  • ignore_malformed: This can can take the value true or false. The default value is false. It should be set to true in order to omit badly formatted values.

Boolean

This is the core type that is designed to be used for indexing. The Boolean values can be true or false. A sample definition of a field based on the Boolean type can be as follows:

"allowed" : { "type" : "boolean" }
Binary

The binary field is a BASE64 representation of the binary data stored in the index. You can use it to store data that is normally written in binary form, like images. Fields based on this type are, by default, stored and not indexed. The binary type only supports the index_name property. A sample field definition based on the binary field looks like the following:

"image" : { "type" : "binary" }

Multi fields

Sometimes you would like to have the same field values in two fields—for example, one for searching and one for faceting. There is a special type in ElasticSearch—multi_field—that allows us to map several core types into a single field and have them analyzed differently. For example, if we would like to calculate faceting and search on our name field, we could define the following multi_field:

"name": {
  "type": "multi_field",
  "fields": {
    "name": { "type" : "string", "index": "analyzed" },
    "facet": { "type" : "string", "index": "not_analyzed" }
  }	
}

The preceding definition will create two fields, one that we could just refer to as name and the second one that we would use as name.facet. Of course, you don't have to specify two separate fields during indexing, a single one named name is enough and ElasticSearch will do the rest.

Using analyzers

As we mentioned during the mappings for the fields based on the string type, we can specify the analyzer used. But what is an analyzer? It's a functionality that is used to analyze data or queries in a way we want them to be indexed or searched—for example, when we divide words on the basis of whitespaces and lowercase characters, we don't have to worry about users sending words in lower- or uppercases. ElasticSearch allows us to use different analyzers for index time and during query time, so we can choose how we want our data to be processed in each stage of the search. To use one of the analyzers, we just need to specify its name to the correct property of the field and that's all!

Out-of-the-box analyzers

ElasticSearch allows us to use one of the many analyzers defined by default. The following analyzers are available out of the box:

  • standard: A standard analyzer that is convenient for most European languages (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html for the full list of parameters).

  • simple: An analyzer that splits the provided value on non-letter characters and converts letters to lowercase.

  • whitespace: An analyzer that splits the provided value on the basis of whitespace characters.

  • stop: This is similar to a simple analyzer; but in addition to the simple analyzer functionality, it filters the data on the provided stop words set (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-analyzer.html for the full list of parameters).

  • keyword: This is a very simple analyzer that just passes the provided value. You'll achieve the same by specifying that field as not_analyzed.

  • pattern: This is an analyzer that allows flexible text separation by the use of regular expressions (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-analyzer.html for the full list of parameters).

  • language: This is an analyzer that is designed to work with a specific language. The full list of languages supported by this analyzer can be found at http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html.

  • snowball: Ths is an analyzer similar to the standard one, but in addition, it provides a stemming algorithm (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer.html for the full list of parameters).

Defining your own analyzers

In addition to the analyzers mentioned previously, ElasticSearch allows us to define new ones. In order to do that, we need to add an additional section to our mappings file, the settings section, which holds the required information for ElasticSearch during index creation. This is how we define our custom settings section:

"settings" : {
  "index" : {
    "analysis": {
      "analyzer": {
        "en": {
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "ourEnglishFilter"
          ]
        }
      },
      "filter": {
        "ourEnglishFilter": {
          "type": "kstem"
        }
      }
    }
  } 
}

As you can see, we specified that we want a new analyzer named en to be present. Each analyzer is built from a single tokenizer and multiple filters. A complete list of default filters and tokenizers can be found at http://www.elasticsearch.org/guide/reference/index-modules/analysis/. As you can see, our en analyzer includes the standard tokenizer and three filters: asciifolding and lowercase—which are available by default—and the ourEnglishFilter, which is a filter that we have defined.

To define a filter, we need to provide its name, its type (the type property), and a number of additional parameters required by that filter type. The full list of filter types available in ElasticSearch can be found at http://www.elasticsearch.org/guide/reference/index-modules/analysis/. That list is changing constantly, so I'll skip commenting on it.

So, the mappings with the analyzer defined would be as follows:

{
  "settings" : {
    "index" : {
      "analysis": {
        "analyzer": {
          "en": {
            "tokenizer": "standard",
            "filter": [
             "asciifolding",
             "lowercase",
             "ourEnglishFilter"
            ]
          }
        },
        "filter": {
          "ourEnglishFilter": {
            "type": "kstem"
          }
        }
      }
    }         
  },
  "mappings" : {
    "post" : {
      "properties" : {                
        "id": { "type" : "long", "store" : "yes", 
        "precision_step" : "0" },
        "name": { "type" : "string", "store" : "yes", "index" : 
        "analyzed", "analyzer": "en" }           
      }
    }
  }
}
Analyzer fields

An analyzer field (_analyzer) allows us to specify a field value that will be used as the analyzer name for the document to which the field belongs. Imagine that you have some software running that detects the language the document is written in and you store that information in the language field in the document. Additionally, you would like to use that information to choose the right analyzer. To do that, just add the following to your mappings file:

"_analyzer" : {
  "path" : "language"
}

So the whole mappings file could be as follows:

{
  "mappings" : {
    "post" : {
      "_analyzer" : {
        "path" : "language"
      },
      "properties" : {                
        "id": { "type" : "long", "store" : "yes", 
        "precision_step" : "0" },
        "name": { "type" : "string", "store" : "yes", 
        "index" : "analyzed" },
        "language": { "type" : "string", "store" : "yes", 
        "index" : "not_analyzed"}           
      }
    }
  }
}

However, please be advised that there has to be an analyzer defined with the same name as the value provided in the language field.

Default analyzers

There is one more thing we should say about analyzers—the ability to specify the one that should be used by default if no analyzer is defined. This is done in the same way as configuring a custom analyzer in the settings section of the mappings file, but instead of specifying a custom name for the analyzer, the default keyword should be used. So to make our previously defined analyzer default, we can change the en analyzer to the following:

{
  "settings" : {
    "index" : {
      "analysis": {
        "analyzer": {
          "default": {
            "tokenizer": "standard",
            "filter": [
             "asciifolding",
             "lowercase",
             "ourEnglishFilter"
            ]
          }
        },
        "filter": {
          "ourEnglishFilter": {
            "type": "kstem"
          }
        }
      
    }
  }

Storing a document source

Sometimes, you may not want to store separate fields; instead, you may want to store the whole input JSON document. In fact, ElasticSearch does that by default. If you want to change that behavior and do not want to include the source of the document, you need to disable the _source field. This is as easy as adding the following part to our type definition:

"_source" : { 
  "enabled" : false 
}

So the whole mappings file would be as follows:

{
  "mappings": {
    "post": {
      "_source": {
        "enabled": false 
      },
      "properties": {                
        "id": {"type":"long", "store":"yes", 
        "precision_step":"0" },
        "name": {"type":"string", "store":"yes", 
        "index":"analyzed" },
        "published": {"type":"date", "store":"yes", 
        "precision_step":"0" },
        "contents": {"type":"string", "store":"no", 
        "index":"analyzed" }             
      }
    }
  }
}

All field

Sometimes, it's handy to have some of the fields copied into one; instead of searching multiple fields, a general purpose field will be used for searching—for example, when you don't know which fields to search on. By default, ElasticSearch will include the values from all the text fields into the _all field. On the other hand, you may want to disable such behavior. To do that we should add the following part to our type definition:

"_all" : { 
  "enabled" : false 
}

So the whole mappings file would look like the following:

{
  "mappings": {
    "post": {
      "_all": {
        "enabled": false 
      },
      "properties": {                
        "id": {"type":"long", "store":"yes", 
        "precision_step":"0" },
        "name": {"type":"string", "store":"yes", 
        "index":"analyzed" },
        "published": {"type":"date", "store":"yes", 
        "precision_step":"0" },
        "contents": {"type":"string", "store":"no", 
        "index":"analyzed" }             
      }
    }
  }
}

However, please remember that the _all field will increase the size of the index, so it should be disabled if not needed.