Book Image

Elasticsearch Server - Third Edition

By : Rafal Kuc
Book Image

Elasticsearch Server - Third Edition

By: Rafal Kuc

Overview of this book

ElasticSearch is a very fast and scalable open source search engine, designed with distribution and cloud in mind, complete with all the goodies that Apache Lucene has to offer. ElasticSearch’s schema-free architecture allows developers to index and search unstructured content, making it perfectly suited for both small projects and large big data warehouses, even those with petabytes of unstructured data. This book will guide you through the world of the most commonly used ElasticSearch server functionalities. You’ll start off by getting an understanding of the basics of ElasticSearch and its data indexing functionality. Next, you will see the querying capabilities of ElasticSearch, followed by a through explanation of scoring and search relevance. After this, you will explore the aggregation and data analysis capabilities of ElasticSearch and will learn how cluster administration and scaling can be used to boost your application performance. You’ll find out how to use the friendly REST APIs and how to tune ElasticSearch to make the most of it. By the end of this book, you will have be able to create amazing search solutions as per your project’s specifications.
Table of Contents (18 chapters)
Elasticsearch Server Third Edition
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
Index

Searching with the URI request query


Before getting into the wonderful world of the Elasticsearch query language, we would like to introduce you to the simple but pretty flexible URI request search, which allows us to use a simple Elasticsearch query combined with the Lucene query language. Of course, we will extend our search knowledge using Elasticsearch in Chapter 3, Searching Your Data, but for now we will stick to the simplest approach.

Sample data

For the purpose of this section of the book, we will create a simple index with two document types. To do this, we will run the following six commands:

curl -XPOST 'localhost:9200/books/es/1' -d '{"title":"Elasticsearch Server", "published": 2013}'
curl -XPOST 'localhost:9200/books/es/2' -d '{"title":"Elasticsearch Server Second Edition", "published": 2014}'
curl -XPOST 'localhost:9200/books/es/3' -d '{"title":"Mastering Elasticsearch", "published": 2013}'
curl -XPOST 'localhost:9200/books/es/4' -d '{"title":"Mastering Elasticsearch Second Edition", "published": 2015}'
curl -XPOST 'localhost:9200/books/solr/1' -d '{"title":"Apache Solr 4 Cookbook", "published": 2012}'
curl -XPOST 'localhost:9200/books/solr/2' -d '{"title":"Solr Cookbook Third Edition", "published": 2015}'

Running the preceding commands will create the book's index with two types: es and solr. The title and published fields will be indexed and thus, searchable.

URI search

All queries in Elasticsearch are sent to the _search endpoint. You can search a single index or multiple indices, and you can restrict your search to a given document type or multiple types. For example, in order to search our book's index, we will run the following command:

curl -XGET 'localhost:9200/books/_search?pretty'

The results returned by Elasticsearch will include all the documents from our book's index (because no query has been specified) and should look similar to the following:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "books",
      "_type" : "es",
      "_id" : "2",
      "_score" : 1.0,
      "_source" : {
        "title" : "Elasticsearch Server Second Edition",
        "published" : 2014
      }
    }, {
      "_index" : "books",
      "_type" : "es",
      "_id" : "4",
      "_score" : 1.0,
      "_source" : {
        "title" : "Mastering Elasticsearch Second Edition",
        "published" : 2015
      }
    }, {
      "_index" : "books",
      "_type" : "solr",
      "_id" : "2",
      "_score" : 1.0,
      "_source" : {
        "title" : "Solr Cookbook Third Edition",
        "published" : 2015
      }
    }, {
      "_index" : "books",
      "_type" : "es",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Elasticsearch Server",
        "published" : 2013
      }
    }, {
      "_index" : "books",
      "_type" : "solr",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Apache Solr 4 Cookbook",
        "published" : 2012
      }
    }, {
      "_index" : "books",
      "_type" : "es",
      "_id" : "3",
      "_score" : 1.0,
      "_source" : {
        "title" : "Mastering Elasticsearch",
        "published" : 2013
      }
    } ]
  }
}

As you can see, the response has a header that tells you the total time of the query and the shards used in the query process. In addition to this, we have documents matching the query—the top 10 documents by default. Each document is described by the index, type, identifier, score, and the source of the document, which is the original document sent to Elasticsearch.

We can also run queries against many indices. For example, if we had another index called clients, we could also run a single query against these two indices as follows:

curl -XGET 'localhost:9200/books,clients/_search?pretty'

We can also run queries against all the data in Elasticsearch by omitting the index names completely or setting the queries to _all:

curl -XGET 'localhost:9200/_search?pretty'
curl -XGET 'localhost:9200/_all/_search?pretty'

In a similar manner, we can also choose the types we want to use during searching. For example, if we want to search only in the es type in the book's index, we run a command as follows:

curl -XGET 'localhost:9200/books/es/_search?pretty' 

Please remember that, in order to search for a given type, we need to specify the index or multiple indices. Elasticsearch allows us to have quite a rich semantics when it comes to choosing index names. If you are interested, please refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-index.html; however, there is one thing we would like to point out. When running a query against multiple indices, it may happen that some of them do not exist or are closed. In such cases, the ignore_unavailable property comes in handy. When set to true, it tells Elasticsearch to ignore unavailable or closed indices.

For example, let's try running the following query:

curl -XGET 'localhost:9200/books,non_existing/_search?pretty' 

The response would be similar to the following one:

{
  "error" : {
    "root_cause" : [ {
      "type" : "index_missing_exception",
      "reason" : "no such index",
      "index" : "non_existing"
    } ],
    "type" : "index_missing_exception",
    "reason" : "no such index",
    "index" : "non_existing"
  },
  "status" : 404
}

Now let's check what will happen if we add the ignore_unavailable=true to our request and execute the following command:

curl -XGET 'localhost:9200/books,non_existing/_search?pretty&ignore_unavailable=true'

In this case, Elasticsearch would return the results without any error.

Elasticsearch query response

Let's assume that we want to find all the documents in our book's index that contain the elasticsearch term in the title field. We can do this by running the following query:

curl -XGET 'localhost:9200/books/_search?pretty&q=title:elasticsearch'

The response returned by Elasticsearch for the preceding request will be as follows:

{
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.625,
    "hits" : [ {
      "_index" : "books",
      "_type" : "es",
      "_id" : "1",
      "_score" : 0.625,
      "_source" : {
        "title" : "Elasticsearch Server",
        "published" : 2013
      }
    }, {
      "_index" : "books",
      "_type" : "es",
      "_id" : "2",
      "_score" : 0.5,
      "_source" : {
        "title" : "Elasticsearch Server Second Edition",
        "published" : 2014
      }
    }, {
      "_index" : "books",
      "_type" : "es",
      "_id" : "4",
      "_score" : 0.5,
      "_source" : {
        "title" : "Mastering Elasticsearch Second Edition",
        "published" : 2015
      }
    }, {
      "_index" : "books",
      "_type" : "es",
      "_id" : "3",
      "_score" : 0.19178301,
      "_source" : {
        "title" : "Mastering Elasticsearch",
        "published" : 2013
      }
    } ]
  }
}

The first section of the response gives us information about how much time the request took (the took property is specified in milliseconds), whether it was timed out (the timed_out property), and information about the shards that were queried during the request execution—the number of queried shards (the total property of the _shards object), the number of shards that returned the results successfully (the successful property of the _shards object), and the number of failed shards (the failed property of the _shards object). The query may also time out if it is executed for a longer period than we want. (We can specify the maximum query execution time using the timeout parameter.) The failed shard means that something went wrong with that shard or it was not available during the search execution.

Of course, the mentioned information can be useful, but usually, we are interested in the results that are returned in the hits object. We have the total number of documents returned by the query (in the total property) and the maximum score calculated (in the max_score property). Finally, we have the hits array that contains the returned documents. In our case, each returned document contains its index name (the _index property), the type (the _type property), the identifier (the _id property), the score (the _score property), and the _source field (usually, this is the JSON object sent for indexing.

Query analysis

You may wonder why the query we've run in the previous section worked. We indexed the Elasticsearch term and ran a query for Elasticsearch and even though they differ (capitalization), the relevant documents were found. The reason for this is the analysis. During indexing, the underlying Lucene library analyzes the documents and indexes the data according to the Elasticsearch configuration. By default, Elasticsearch will tell Lucene to index and analyze both string-based data as well as numbers. The same happens during querying because the URI request query maps to the query_string query (which will be discussed in Chapter 3, Searching Your Data), and this query is analyzed by Elasticsearch.

Let's use the indices-analyze API (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html). It allows us to see how the analysis process is done. With this, we can see what happened to one of the documents during indexing and what happened to our query phrase during querying.

In order to see what was indexed in the title field of the Elasticsearch server phrase, we will run the following command:

curl -XGET 'localhost:9200/books/_analyze?pretty&field=title' -d 'Elasticsearch Server'

The response will be as follows:

{
  "tokens" : [ {
    "token" : "elasticsearch",
    "start_offset" : 0,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "server",
    "start_offset" : 14,
    "end_offset" : 20,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

You can see that Elasticsearch has divided the text into two terms—the first one has a token value of elasticsearch and the second one has a token value of the server.

Now let's look at how the query text was analyzed. We can do this by running the following command:

curl -XGET 'localhost:9200/books/_analyze?pretty&field=title' -d 'elasticsearch'

The response of the request will look as follows:

{
  "tokens" : [ {
    "token" : "elasticsearch",
    "start_offset" : 0,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 0
  } ]
}

We can see that the word is the same as the original one that we passed to the query. We won't get into the Lucene query details and how the query parser constructed the query, but in general the indexed term after the analysis was the same as the one in the query after the analysis; so, the document matched the query and the result was returned.

URI query string parameters

There are a few parameters that we can use to control URI query behavior, which we will discuss now. The thing to remember is that each parameter in the query should be concatenated with the & character, as shown in the following example:

curl -XGET 'localhost:9200/books/_search?pretty&q=published:2013&df=title&explain=true&default_operator=AND'

Please remember to enclose the URL of the request using the ' characters because, on Linux-based systems, the & character will be analyzed by the Linux shell.

The query

The q parameter allows us to specify the query that we want our documents to match. It allows us to specify the query using the Lucene query syntax described in the Lucene query syntax section later in this chapter. For example, a simple query would look like this: q=title:elasticsearch.

The default search field

Using the df parameter, we can specify the default search field that should be used when no field indicator is used in the q parameter. By default, the _all field will be used. (This is the field that Elasticsearch uses to copy the content of all the other fields. We will discuss this in greater depth in Chapter 2, Indexing Your Data). An example of the df parameter value can be df=title.

Analyzer

The analyzer property allows us to define the name of the analyzer that should be used to analyze our query. By default, our query will be analyzed by the same analyzer that was used to analyze the field contents during indexing.

The default operator property

The default_operator property that can be set to OR or AND, allows us to specify the default Boolean operator used for our query (http://en.wikipedia.org/wiki/Boolean_algebra). By default, it is set to OR, which means that a single query term match will be enough for a document to be returned. Setting this parameter to AND for a query will result in returning the documents that match all the query terms.

Query explanation

If we set the explain parameter to true, Elasticsearch will include additional explain information with each document in the result—such as the shard from which the document was fetched and the detailed information about the scoring calculation (we will talk more about it in the Understanding the explain information section in Chapter 6, Make Your Search Better). Also remember not to fetch the explain information during normal search queries because it requires additional resources and adds performance degradation to the queries. For example, a query that includes explain information could look as follows:

curl -XGET 'localhost:9200/books/_search?pretty&explain=true&q=title:solr'

The results returned by Elasticsearch for the preceding query would be as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.70273256,
    "hits" : [ {
      "_shard" : 2,
      "_node" : "v5iRsht9SOWVzu-GY-YHlA",
      "_index" : "books",
      "_type" : "solr",
      "_id" : "2",
      "_score" : 0.70273256,
      "_source" : {
        "title" : "Solr Cookbook Third Edition",
        "published" : 2015
      },
      "_explanation" : {
        "value" : 0.70273256,
        "description" : "weight(title:solr in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 0.70273256,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0",
              "details" : [ ]
            } ]
          }, {
            "value" : 1.4054651,
            "description" : "idf(docFreq=1, maxDocs=3)",
            "details" : [ ]
          }, {
            "value" : 0.5,
            "description" : "fieldNorm(doc=0)",
            "details" : [ ]
          } ]
        } ]
      }
    }, {
      "_shard" : 3,
      "_node" : "v5iRsht9SOWVzu-GY-YHlA",
      "_index" : "books",
      "_type" : "solr",
      "_id" : "1",
      "_score" : 0.5,
      "_source" : {
        "title" : "Apache Solr 4 Cookbook",
        "published" : 2012
      },
      "_explanation" : {
        "value" : 0.5,
        "description" : "weight(title:solr in 1) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 0.5,
          "description" : "fieldWeight in 1, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0",
              "details" : [ ]
            } ]
          }, {
            "value" : 1.0,
            "description" : "idf(docFreq=1, maxDocs=2)",
            "details" : [ ]
          }, {
            "value" : 0.5,
            "description" : "fieldNorm(doc=1)",
            "details" : [ ]
          } ]
        } ]
      }
    } ]
  }
}

The fields returned

By default, for each document returned, Elasticsearch will include the index name, the type name, the document identifier, score, and the _source field. We can modify this behavior by adding the fields parameter and specifying a comma-separated list of field names. The field will be retrieved from the stored fields (if they exist; we will discuss them in Chapter 2, Indexing Your Data) or from the internal _source field. By default, the value of the fields parameter is _source. An example is: fields=title,priority.

We can also disable the fetching of the _source field by adding the _source parameter with its value set to false.

Sorting the results

Using the sort parameter, we can specify custom sorting. The default behavior of Elasticsearch is to sort the returned documents in descending order of the value of the _score field. If we want to sort our documents differently, we need to specify the sort parameter. For example, adding sort=published:desc will sort the documents in descending order of published field. By adding the sort=published:asc parameter, we will tell Elasticsearch to sort the documents on the basis of the published field in ascending order.

If we specify custom sorting, Elasticsearch will omit the _score field calculation for the documents. This may not be the desired behavior in your case. If you want to still keep a track of the scores for each document when using a custom sort, you should add the track_scores=true property to your query. Please note that tracking the scores when doing custom sorting will make the query a little bit slower (you may not even notice the difference) due to the processing power needed to calculate the score.

The search timeout

By default, Elasticsearch doesn't have timeout for queries, but you may want your queries to timeout after a certain amount of time (for example, 5 seconds). Elasticsearch allows you to do this by exposing the timeout parameter. When the timeout parameter is specified, the query will be executed up to a given timeout value and the results that were gathered up to that point will be returned. To specify a timeout of 5 seconds, you will have to add the timeout=5s parameter to your query.

The results window

Elasticsearch allows you to specify the results window (the range of documents in the results list that should be returned). We have two parameters that allow us to specify the results window size: size and from. The size parameter defaults to 10 and defines the maximum number of results returned. The from parameter defaults to 0 and specifies from which document the results should be returned. In order to return five documents starting from the 11th one, we will add the following parameters to the query: size=5&from=10.

Limiting per-shard results

Elasticsearch allows us to specify the maximum number of documents that should be fetched from each shard using terminate_after property and specifying the maximum number of documents. For example, if we want to get no more than 100 documents from each shard, we can add terminate_after=100 to our URI request.

Ignoring unavailable indices

When running queries against multiple indices, it is handy to tell Elasticsearch that we don't care about the indices that are not available. By default, Elasticsearch will throw an error if one of the indices is not available, but we can change this by simply adding the ignore_unavailable=true parameter to our URI request.

The search type

The URI query allows us to specify the search type using the search_type parameter, which defaults to query_then_fetch. Two values that we can use here are: dfs_query_then_fetch and query_then_fetch. The rest of the search types available in older Elasticsearch versions are now deprecated or removed. We'll learn more about search types in the Understanding the querying process section of Chapter 3, Searching Your Data.

Lowercasing term expansion

Some queries, such as the prefix query, use query expansion. We will discuss this in the Query rewrite section in Chapter 4, Extending Your Querying Knowledge. We are allowed to define whether the expanded terms should be lowercased or not using the lowercase_expanded_terms property. By default, the lowercase_expanded_terms property is set to true, which means that the expanded terms will be lowercased.

Wildcard and prefix analysis

By default, wildcard queries and prefix queries are not analyzed. If we want to change this behavior, we can set the analyze_wildcard property to true.

Note

If you want to see all the parameters exposed by Elasticsearch as the URI request parameters, please refer to the official documentation available at: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-uri-request.html.

Lucene query syntax

We thought that it would be good to know a bit more about what syntax can be used in the q parameter passed in the URI query. Some of the queries in Elasticsearch (such as the one currently being discussed) support the Lucene query parser syntax—the language that allows you to construct queries. Let's take a look at it and discuss some basic features.

A query that we pass to Lucene is divided into terms and operators by the query parser. Let's start with the terms; you can distinguish them into two types—single terms and phrases. For example, to query for a book term in the title field, we will pass the following query:

title:book

To query for the elasticsearch book phrase in the title field, we will pass the following query:

title:"elasticsearch book"

You may have noticed the name of the field in the beginning and in the term or the phrase later.

As we already said, the Lucene query syntax supports operators. For example, the + operator tells Lucene that the given part must be matched in the document, meaning that the term we are searching for must present in the field in the document. The - operator is the opposite, which means that such a part of the query can't be present in the document. A part of the query without the + or - operator will be treated as the given part of the query that can be matched but it is not mandatory. So, if we want to find a document with the book term in the title field and without the cat term in the description field, we send the following query:

+title:book -description:cat

We can also group multiple terms with parentheses, as shown in the following query:

title:(crime punishment)

We can also boost parts of the query (this increases their importance for the scoring algorithm —the higher the boost, the more important the query part is) with the ^ operator and the boost value after it, as shown in the following query:

title:book^4

These are the basics of the Lucene query language and should allow you to use Elasticsearch and construct queries without any problems. However, if you are interested in the Lucene query syntax and you would like to explore that in depth, please refer to the official documentation of the query parser available at http://lucene.apache.org/core/5_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html.