Book Image

Elasticsearch Server: Second Edition

Book Image

Elasticsearch Server: Second Edition

Overview of this book

Table of Contents (18 chapters)
Elasticsearch Server Second Edition
Credits
About the Author
Acknowledgments
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Searching with the URI request query


Before going into the details of Elasticsearch querying, we will use its capabilities of using a simple URI request to search. Of course, we will extend our search knowledge using Elasticsearch in Chapter 3, Searching Your Data, but for now, we will stick to the simplest approach.

Sample data

For the purpose of this section of the book, we will create a simple index with two document types. To do this, we will run the following commands:

curl -XPOST 'localhost:9200/books/es/1' -d '{"title":"Elasticsearch Server", "published": 2013}'
curl -XPOST 'localhost:9200/books/es/2' -d '{"title":"Mastering Elasticsearch", "published": 2013}'
curl -XPOST 'localhost:9200/books/solr/1' -d '{"title":"Apache Solr 4 Cookbook", "published": 2012}'

Running the preceding commands will create the books index with two types: es and solr. The title and published fields will be indexed. If you want to check this, you can do so by running the mappings API call using the following command (we will talk about the mappings in the Mappings configuration section of Chapter 2, Indexing Your Data):

curl -XGET 'localhost:9200/books/_mapping?pretty'

This will result in Elasticsearch returning the mappings for the whole index.

The URI request

All the queries in Elasticsearch are sent to the _search endpoint. You can search a single index or multiple indices, and you can also narrow down your search only to a given document type or multiple types. For example, in order to search our books index, we will run the following command:

curl -XGET 'localhost:9200/books/_search?pretty'

If we have another index called clients, we can also run a single query against these two indices as follows:

curl -XGET 'localhost:9200/books,clients/_search?pretty'

In the same manner, we can also choose the types we want to use during searching. For example, if we want to search only in the es type in the books index, we will run a command as follows:

curl -XGET 'localhost:9200/books/es/_search?pretty'

Note

Please remember that in order to search for a given type, we need to specify the index or indices. If we want to search for any index, we just need to set * as the index name or omit the index name totally. Elasticsearch allows quite a rich semantics when it comes to choosing index names. If you are interested, please refer to http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/multi-index.html.

We can also search all the indices by omitting the indices and types. For example, the following command will result in a search through all the data in our cluster:

curl -XGET 'localhost:9200/_search?pretty'

The Elasticsearch query response

Let's assume that we want to find all the documents in our books index that contain the elasticsearch term in the title field. We can do this by running the following query:

curl -XGET 'localhost:9200/books/_search?pretty&q=title:elasticsearch'

The response returned by Elasticsearch for the preceding request will be as follows:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.625,
    "hits" : [ {
      "_index" : "books",
      "_type" : "es",
      "_id" : "1",
      "_score" : 0.625, "_source" : {"title":"Elasticsearch Server", "published": 2013}
    }, {
      "_index" : "books",
      "_type" : "es",
      "_id" : "2",
      "_score" : 0.19178301, "_source" : {"title":"Mastering Elasticsearch", "published": 2013}
    } ]
  }
}

The first section of the response gives us the information on how much time the request took (the took property is specified in milliseconds); whether it was timed out (the timed_out property); and information on the shards that were queried during the request execution—the number of queried shards (the total property of the _shards object), the number of shards that returned the results successfully (the successful property of the _shards object), and the number of failed shards (the failed property of the _shards object). The query may also time out if it is executed for a longer time than we want. (We can specify the maximum query execution time using the timeout parameter.) The failed shard means that something went wrong on that shard or it was not available during the search execution.

Of course, the mentioned information can be useful, but usually, we are interested in the results that are returned in the hits object. We have the total number of documents returned by the query (in the total property) and the maximum score calculated (in the max_score property). Finally, we have the hits array that contains the returned documents. In our case, each returned document contains its index name (the _index property), type (the _type property), identifier (the _id property), score (the _score property), and the _source field (usually, this is the JSON object sent for indexing; we will discuss this in the Extending your index structure with additional internal information section in Chapter 2, Indexing Your Data.

Query analysis

You may wonder why the query we've run in the previous section worked. We indexed the Elasticsearch term and ran a query for elasticsearch and even though they differ (capitalization), relevant documents were found. The reason for this is the analysis. During indexing, the underlying Lucene library analyzes the documents and indexes the data according to the Elasticsearch configuration. By default, Elasticsearch will tell Lucene to index and analyze both string-based data as well as numbers. The same happens during querying because the URI request query maps to the query_string query (which will be discussed in Chapter 3, Searching Your Data), and this query is analyzed by Elasticsearch.

Let's use the indices analyze API (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html). It allows us to see how the analysis process is done. With it, we can see what happened to one of the documents during indexing and what happened to our query phrase during querying.

In order to see what was indexed in the title field for the Elasticsearch Server phrase, we will run the following command:

curl -XGET 'localhost:9200/books/_analyze?field=title' -d 'Elasticsearch Server'

The response will be as follows:

{
  "tokens" : [ {
    "token" : "elasticsearch",
    "start_offset" : 0,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "server",
    "start_offset" : 14,
    "end_offset" : 20,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

We can see that Elasticsearch has divided the text into two terms—the first one has a token value of elasticsearch and the second one has a token value of server.

Now let's look at how the query text was analyzed. We can do that by running the following command:

curl -XGET 'localhost:9200/books/_analyze?pretty&field=title' -d 'elasticsearch'

The response of the request looks as follows:

{
  "tokens" : [ {
    "token" : "elasticsearch",
    "start_offset" : 0,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

We can see that the word is the same as the original one that we passed to the query. We won't get into Lucene query details and how the query parser constructed the query, but in general, the indexed term after analysis was the same as the one in the query after analysis; so, the document matched the query and the result was returned.

URI query string parameters

There are a few parameters that we can use to control the URI query behavior, which we will discuss now. Each parameter in the query should be concatenated with the & character, as shown in the following example:

curl -XGET 'localhost:9200/books/_search?pretty&q=published:2013&df=title&explain=true&default_operator=AND'

Please also remember about the ' characters because on Linux-based systems, the & character will be analyzed by the Linux shell.

The query

The q parameter allows us to specify the query that we want our documents to match. It allows us to specify the query using the Lucene query syntax described in the The Lucene query syntax section in this chapter. For example, a simple query could look like q=title:elasticsearch.

The default search field

By using the df parameter, we can specify the default search field that should be used when no field indicator is used in the q parameter. By default, the _all field will be used (the field that Elasticsearch uses to copy the content of all the other fields. We will discuss this in greater depth in the Extending your index structure with additional internal information section in Chapter 2, Indexing Your Data). An example of the df parameter value can be df=title.

Analyzer

The analyzer property allows us to define the name of the analyzer that should be used to analyze our query. By default, our query will be analyzed by the same analyzer that was used to analyze the field contents during indexing.

The default operator

The default_operator property which can be set to OR or AND allows us to specify the default Boolean operator used for our query. By default, it is set to OR, which means that a single query term match will be enough for a document to be returned. Setting this parameter to AND for a query will result in the returning of documents that match all the query terms.

Query explanation

If we set the explain parameter to true, Elasticsearch will include additional explain information with each document in the result—such as the shard, from which the document was fetched, and detailed information about the scoring calculation (we will talk more about it in the Understanding the explain information section in Chapter 5, Make Your Search Better). Also remember not to fetch the explain information during normal search queries because it requires additional resources and adds performance degradation to the queries. For example, a single result can look like the following code:

{
  "_shard" : 3,
  "_node" : "kyuzK62NQcGJyhc2gI1P2w",
  "_index" : "books",
  "_type" : "es",
  "_id" : "2",
  "_score" : 0.19178301, "_source" : {"title":"Mastering Elasticsearch", "published": 2013},
  "_explanation" : {
    "value" : 0.19178301,
    "description" : "weight(title:elasticsearch in 0) [PerFieldSimilarity], result of:",
    "details" : [ {
      "value" : 0.19178301,
      "description" : "fieldWeight in 0, product of:",
      "details" : [ {
        "value" : 1.0,
        "description" : "tf(freq=1.0), with freq of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "termFreq=1.0"
        } ]
      }, {
        "value" : 0.30685282,
        "description" : "idf(docFreq=1, maxDocs=1)"
      }, {
        "value" : 0.625,
        "description" : "fieldNorm(doc=0)"
      } ]
    } ]
  }
}
The fields returned

By default, for each document returned, Elasticsearch will include the index name, type name, document identifier, score, and the _source field. We can modify this behavior by adding the fields parameter and specifying a comma-separated list of field names. The field will be retrieved from the stored fields (if they exist) or from the internal _source field. By default, the value of the fields parameter is _source. An example can be like this fields=title.

Note

We can also disable the fetching of the _source field by adding the _source parameter with its value set to false.

Sorting the results

By using the sort parameter, we can specify custom sorting. The default behavior of Elasticsearch is to sort the returned documents by their score in the descending order. If we would like to sort our documents differently, we need to specify the sort parameter. For example, adding sort=published:desc will sort the documents by the published field in the descending order. By adding the sort=published:asc parameter, we will tell Elasticsearch to sort the documents on the basis of the published field in the ascending order.

If we specify custom sorting, Elasticsearch will omit the _score field calculation for documents. This may not be the desired behavior in your case. If you want to still keep a track of the scores for each document when using custom sort, you should add the track_scores=true property to your query. Please note that tracking the scores when doing custom sorting will make the query a little bit slower (you may even not notice it) due to the processing power needed to calculate the score.

The search timeout

By default, Elasticsearch doesn't have timeout for queries, but you may want your queries to timeout after a certain amount of time (for example, 5 seconds). Elasticsearch allows you to do this by exposing the timeout parameter. When the timeout parameter is specified, the query will be executed up to a given timeout value, and the results that were gathered up to that point will be returned. To specify a timeout of 5 seconds, you will have to add the timeout=5s parameter to your query.

The results window

Elasticsearch allows you to specify the results window (the range of documents in the results list that should be returned). We have two parameters that allow us to specify the results window size: size and from. The size parameter defaults to 10 and defines the maximum number of results returned. The from parameter defaults to 0 and specifies from which document the results should be returned. In order to return five documents starting from the eleventh one, we will add the following parameters to the query: size=5&from=10.

The search type

The URI query allows us to specify the search type by using the search_type parameter, which defaults to query_then_fetch. There are six values that we can use: dfs_query_then_fetch, dfs_query_and_fetch, query_then_fetch, query_and_fetch, count, and scan. We'll learn more about search types in the Understanding the querying process section in Chapter 3, Searching Your Data.

Lowercasing the expanded terms

Some of the queries use query expansion, such as the prefix query. We will discuss this in the Query rewrite section of Chapter 3, Searching Your Data. We are allowed to define whether the expanded terms should be lowercased or not by using the lowercase_expanded_terms property. By default, the lowercase_expanded_terms property is set to true, which means that the expanded terms will be lowercased.

Analyzing the wildcard and prefixes

By default, the wildcard queries and the prefix queries are not analyzed. If we want to change this behavior, we can set the analyze_wildcard property to true.

The Lucene query syntax

We thought that it will be good to know a bit more about what syntax can be used in the q parameter passed in the URI query. Some of the queries in Elasticsearch (such as the one currently discussed) support the Lucene query parsers syntax—the language that allows you to construct queries. Let's take a look at it and discuss some basic features. To read about the full Lucene query syntax, please go to the following web page: http://lucene.apache.org/core/4_6_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html.

A query that we pass to Lucene is divided into terms and operators by the query parser. Let's start with the terms—you can distinguish them into two types—single terms and phrases. For example, to query for a term book in the title field, we will pass the following query:

title:book

To query for a phrase elasticsearch book in the title field, we will pass the following query:

title:"elasticsearch book"

You may have noticed the name of the field in the beginning and in the term or phrase later.

As we already said, the Lucene query syntax supports operators. For example, the + operator tells Lucene that the given part must be matched in the document. The - operator is the opposite, which means that such a part of the query can't be present in the document. A part of the query without the + or - operator will be treated as the given part of the query that can be matched but it is not mandatory. So, if we would like to find a document with the term book in the title field and without the term cat in the description field, we will pass the following query:

+title:book -description:cat

We can also group multiple terms with parenthesis, as shown in the following query:

title:(crime punishment)

We can also boost parts of the query with the ^ operator and the boost value after it, as shown in the following query:

title:book^4