Book Image

Elasticsearch 8.x Cookbook - Fifth Edition

By : Alberto Paro
Book Image

Elasticsearch 8.x Cookbook - Fifth Edition

By: Alberto Paro

Overview of this book

Elasticsearch is a Lucene-based distributed search engine at the heart of the Elastic Stack that allows you to index and search unstructured content with petabytes of data. With this updated fifth edition, you'll cover comprehensive recipes relating to what's new in Elasticsearch 8.x and see how to create and run complex queries and analytics. The recipes will guide you through performing index mapping, aggregation, working with queries, and scripting using Elasticsearch. You'll focus on numerous solutions and quick techniques for performing both common and uncommon tasks such as deploying Elasticsearch nodes, using the ingest module, working with X-Pack, and creating different visualizations. As you advance, you'll learn how to manage various clusters, restore data, and install Kibana to monitor a cluster and extend it using a variety of plugins. Furthermore, you'll understand how to integrate your Java, Scala, Python, and big data applications such as Apache Spark and Pig with Elasticsearch and create efficient data applications powered by enhanced functionalities and custom plugins. By the end of this Elasticsearch cookbook, you'll have gained in-depth knowledge of implementing the Elasticsearch architecture and be able to manage, search, and store data efficiently and effectively using Elasticsearch.
Table of Contents (20 chapters)

Mapping base types

Using explicit mapping makes it possible to start to quickly ingest the data using a schemaless approach without being concerned about field types. Thus, to achieve better results and performance in indexing, it's required to manually define a mapping.

Fine-tuning mapping brings some advantages, such as the following:

  • Reducing the index size on the disk (disabling functionalities for custom fields)
  • Indexing only interesting fields (general speed up)
  • Precooking data for fast search or real-time analytics (such as aggregations)
  • Correctly defining whether a field must be analyzed in multiple tokens or considered as a single token
  • Defining mapping types such as geo point, suggester, vectors, and so on

Elasticsearch allows you to use base fields with a wide range of configurations.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

To execute this recipe's examples, you will need to create an index with a test name, where you can put mappings, as explained in the Using explicit mapping creation recipe.

How to do it...

Let's use a semi real-world example of a shop order for our eBay-like shop:

  1. First, we must define an order:
Figure 2.1 – Example of an order

Figure 2.1 – Example of an order

  1. Our order record must be converted into an Elasticsearch mapping definition, as follows:
    PUT test/_mapping
    {  "properties" : {
          "id" : {"type" : "keyword"},
          "date" : {"type" : "date"},
          "customer_id" : {"type" : "keyword"},
          "sent" : {"type" : "boolean"},
          "name" : {"type" : "keyword"},
          "quantity" : {"type" : "integer"},
          "price" : {"type" : "double"},
          "vat" : {"type" : "double", "index": false}
    } }

Now, the mapping is ready to be put in the index. We will learn how to do this in the Putting a mapping in an index recipe of Chapter 3, Basic Operations.

How it works...

Field types must be mapped to one of the Elasticsearch base types, and options on how the field must be indexed need to be added.

The following table is a reference for the mapping types:

Figure 2.2 – Base type mapping

Figure 2.2 – Base type mapping

Depending on the data type, it's possible to give explicit directives to Elasticsearch when you're processing the field for better management. The most used options are as follows:

  • store (default false): This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space but reduces computation if you need to extract it from a document (that is, in scripting and aggregations). The possible values for this option are true and false. They are always retuned as an array of values for consistency.

The stored fields are faster than others in aggregations.

  • index: This defines whether or not the field should be indexed. The possible values for this parameter are true and false. Index fields are not searchable (the default is true).
  • null_value: This defines a default value if the field is null.
  • boost: This is used to change the importance of a field (the default is 1.0).

boost works on a term level only, so it's mainly used in term, terms, and match queries.

  • search_analyzer: This defines an analyzer to be used during the search. If it's not defined, the analyzer of the parent object is used (the default is null).
  • analyzer: This sets the default analyzer to be used (the default is null).
  • norms: This controls the Lucene norms. This parameter is used to score queries better. If the field is only used for filtering, it's a best practice to disable it to reduce resource usage (true for analyzed fields and false for not_analyzed ones).
  • copy_to: This allows you to copy the content of a field to another one to achieve functionalities, similar to the _all field.
  • ignore_above: This allows you to skip the indexing string if it's bigger than its value. This is useful for processing fields for exact filtering, aggregations, and sorting. It also prevents a single term token from becoming too big and prevents errors due to the Lucene term's byte-length limit of 32,766. The maximum suggested value is 8191 (https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html).

There's more...

From Elasticsearch version 6.x onward, as shown in the Using explicit mapping creation recipe, the explicit inferred type for a string is a multifield mapping:

  • The default processing is text. This mapping allows textual queries (that is, term, match, and span queries). In the example provided in the Using explicit mapping creation recipe, this was name.
  • The keyword subfield is used for keyword mapping. This field can be used for exact term matching and aggregation and sorting. In the example provided in the Using explicit mapping creation recipe, the referred field was name.keyword.

Another important parameter, available only for text mapping, is term_vector (the vector of terms that compose a string). Please refer to the Lucene documentation for further details at https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/index/Terms.html.

term_vector can accept the following values:

  • no: This is the default value; that is, skip term vector.
  • yes: This is the store term vector.
  • with_offsets: This is the store term vector with a token offset (start, end position in a block of characters).
  • with_positions: This is used to store the position of the token in the term vector.
  • with_positions_offsets: This stores all the term vector data.
  • with_positions_payloads: This is used to store the position and payloads of the token in the term vector.
  • with_positions_offsets_payloads: This stores all the term vector data with payloads.

Term vectors allow fast highlighting but consume disk space due to storing additional text information. It's a best practice to only activate it in fields that require highlighting, such as title or document content.

See also

You can refer to the following sources for further details on the concepts of this chapter:

  • The online documentation on Elasticsearch provides a full description of all the properties for the different mapping fields at https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-params.html.
  • The Specifying a different analyzer recipe at the end of this chapter shows alternative analyzers to the standard one.
  • For newcomers who want to explore the concepts of tokenization, I would suggest reading the official Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html.