Book Image

Mastering Elasticsearch 5.x - Third Edition

Book Image

Mastering Elasticsearch 5.x - Third Edition

Overview of this book

Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. Elasticsearch leverages the capabilities of Apache Lucene, and provides a new level of control over how you can index and search even huge sets of data. This book will give you a brief recap of the basics and also introduce you to the new features of Elasticsearch 5. We will guide you through the intermediate and advanced functionalities of Elasticsearch, such as querying, indexing, searching, and modifying data. We’ll also explore advanced concepts, including aggregation, index control, sharding, replication, and clustering. We’ll show you the modules of monitoring and administration available in Elasticsearch, and will also cover backup and recovery. You will get an understanding of how you can scale your Elasticsearch cluster to contextualize it and improve its performance. We’ll also show you how you can create your own analysis plugin in Elasticsearch. By the end of the book, you will have all the knowledge necessary to master Elasticsearch and put it to efficient use.
Table of Contents (20 chapters)
Mastering Elasticsearch 5.x - Third Edition
Credits
About the Author
Acknowledgements
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Introducing Elasticsearch 5.x


In 2015, Elasticsearch, after acquiring Kibana, Logstash, Beats, and Found, re-branded the company name as Elastic. According to Shay Banon, the name change is part of an initiative to better align the company with the broad solutions it provides: future products, and new innovations created by Elastic's massive community of developers and enterprises that utilize the ELK stack for everything from real-time search, to sophisticated analytics, to building modern data applications.

But having several products under one hood resulted in discord among them during the release process and started creating confusion for the users. This resulted in the ELK stack being renamed to Elastic Stack and the company decided to keep releasing all components of the Elastic Stack together. This is so that they will all share the same version number for all the products to keep speed with your deployments, simplify compatibility testing, and make it even easier for developers to add new functionality across the stack.

The very first GA release under Elastic stack is 5.0.0, which will be covered throughout this book. Further, Elasticsearch keeps pace with Lucene version releases to incorporate bug fixes and the latest features into Elasticsearch. Elasticsearch 5.0 is based on Lucene 6, which is a major release from Lucene with some awesome new features and a focus on improving the search speed. We will discuss Lucene 6 in upcoming chapters to let you know how Elasticsearch is going to have some awesome improvements, both from search and storage points of view.

Introducing new features in Elasticsearch

Elasticsearch 5.x has many improvements and has gone through a great refactoring, which caused removal/deprecation of some features. We will keep discussing the removed/improved/new features in upcoming chapters, but for now let's take an overview of the new and improved things in Elasticsearch.

New features in Elasticsearch 5.x

Following are some of the most important features introduced in Elasticsearch version 5.0:

  • Ingest node: This node is a new type of node in Elasticsearch, which can be used for simple data transformation and enrichment before actual data indexing takes place. The best thing is that any node can be configured to act as an ingest node and it is very lighter across the board. You can avoid Logstash for these tasks because the ingest node is a Java based implementation of the Logstash filter and comes as a default in Elasticsearch itself.

  • Index shrinking: By design, once an index is created, there is no provision of reducing the number of shards for that index and this brings a lot of challenges since each shard consumes some resources. Although this design still remains same, to make life easier for users, Elasticsearch has introduced a new _shrink API to overcome this problem. This API allows you to shrink an existing index into a newer index with a fewer number of shards.

    Note

    We will cover the ingest node and shrink API in detail under Chapter 9, Data Transformation and Federated Search.

  • Painless scripting language: In Elasticsearch, scripting has always been a matter of concern because of its slowness and for security reasons. Elasticsearch 5.0 includes a new scripting language called Painless, which has been designed to be fast and secure. Painless is still going through lots of improvements to make it more awesome and easily adaptable. We will cover it under Chapter 3, Beyond Full Text Search.

  • Instant aggregations: Queries have been completely refactored in 5.0; they are now parsed on the coordinating node and serialized to different nodes in a binary format. This allows Elasticsearch to be much more efficient, with more cache-able queries, especially on data separated into time-based indices. This will cause a significant speed up for aggregations.

  • A new completion suggester: The completion suggester has undergone a complete rewrite. This means that the syntax and data structure for fields of type completion have changed, as have the syntax and response of the completion suggester requests. The completion suggester is now built on top of the first iteration of Lucene's new suggest API.

  • Multi-dimensional points: This is one of the most exciting features of Lucene 6, which empowers Elasticsearch 5.0. It is built using the k-d tree geospatial data structure to offer a fast single- and multi-dimensional numeric range and a geospatial point-in-shape filtering. A multi-dimensional point helps in reducing disk storage, memory utilization, and faster searches.

  • Delete by Query API: After much demand from the community, Elasticsearch has finally provided the ability to delete documents based on a matching query using the _delete_by_query REST endpoint.

New features in Elasticsearch 2.x

Apart from the features discussed just now, you can also benefit from all of the new features that came in Elasticsearch version 2.x. For those who have not had a look at the 2.x series, let's have a quick revamp of the new features which came with Elasticsearch under this series:

  • Reindex API: In Elasticsearch, re-indexing of documents is almost needed by every user, under several scenarios. The _reindex API makes this task very easy and you do not need to worry about writing your own code to do the same. This API, at the simplest level, provides the ability to move data from one index to another but also provides a great control while re-indexing the documents, such as using scripts for data transformation and many other parameters. You can take a look at the reindex API at following URL https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-reindex.html.

  • Update by query: Similar to re-indexing requirements, a user also demands to easily update the documents in place, based on certain conditions, without re-indexing the data. Elasticsearch provided this feature using the update_by_query REST endpoint in version 2.x.

  • Tasks API: The task management API, which is exposed by the _task REST endpoint, is used for retrieving information about the currently executing tasks on one or more nodes in the cluster. The following examples show the usage of the tasks API:

GET /_tasks 
GET /_tasks?nodes=nodeId1,nodeId2 
GET /_tasks?nodes=nodeId1&actions=cluster;* 
  • Since each task has an ID, you can either wait for the completion of the task or cancel the task in the following way:

POST /_tasks/taskId1/_cancel
  • Query profiler: The Profile API is an awesome tool to debug the queries and get the insights to know why a certain query is slow and take steps to improve it. This API was released in the 2.2.0 version and provides detailed timing information about the execution of individual components in a search request. You just need to send profile as true with your query object to get this working for you. For example:

  curl -XGET 'localhost:9200/_search' -d '{ 
    "profile": true,  
    "query" : { 
      "match" : { "message" : "query profiling test" } 
    } 
  }' 

The changes in Elasticsearch

The change list is very long and covering all the change details is out of the scope of this book, since most of the changes are internal level changes which a user should not be worried about. However, we will cover the most important changes an existing Elasticsearch user must know.

Although this book is based on Elasticsearch version 5.0, it is very important for the reader to get to know the changes being made between versions 1.x to 2.x. If you are new to Elasticsearch and are not aware about older versions, you can skip this section.

Changes between 1.x to 2.x

Elasticsearch version 2.x was focused on resiliency, reliability, simplification, and features. This release was based on Apache Lucene 5.x and specifically improves query execution and spatial search.

Version 2.x also delivers considerable improvements in index recovery. Historically, Elasticsearch index recovery was extremely painful, whether as part of node maintenance or an upgrade. The bigger the cluster, the bigger the headache. Node failures or a reboot can trigger a shard reallocation storm, and entire shards are sometimes copied over the network, despite having whole data. Users have also reported more than a day of recovery time to restart a single node.

With 2.x, recovery of existing replica shards became almost instant, and there is more lenient reallocation, which avoids reshuffling and makes rolling upgrades much easier and faster. Auto-regulating feedback loops in recent updates also eliminates past worries about merge throttling and related settings.

Elasticsearch 2.x also solved many of the known issues that plagued previous versions, including:

  • Mapping conflicts (often yielding wrong results)

  • Memory pressures and frequent garbage collections

  • Low reliability of data

  • Security breaches and split brains

  • Slow recovery during node maintenance or rolling cluster upgrades

Mapping changes

Elasticsearch developers earlier assumed an index as a database and a type as a table. This allowed users to create multiple types inside the same index, but eventually became a major source of issues because of restrictions imposed by Lucene.

Fields that have the same name inside multiple types in a single index are mapped to a single field inside Lucene. Incorrect query outcomes and index corruption can result from a field in one document type being of an integer type while a field in another document type being of a string type. Several other issues can lead to mapping refactoring and major restrictions on handling mapping conflicts.

The following are the most significant changes imposed by Elasticsearch version 2.x:

  • Field names must be referenced by full name.

  • Field names cannot be referenced using a type name prefix.

  • Field names can't contain dots.

  • Type names can't start with a dot (.percolator is an exception)

  • Type names may not be longer than 255 characters.

  • Types may no longer be deleted. So, if an index contains multiple types, you cannot delete any of the types from the index. The only solution is to create a new index and reindex the data.

  • index_analyzer and _analyzer parameters were removed from mapping definitions.

  • Doc values became default.

  • A parent type can't pre-exist and must be included when creating child type.

  • The ignore_conflicts option of the put mappings API got removed and conflicts cannot be ignored anymore.

  • Documents and mappings can't contain metadata fields that start with an underscore. So, if you have an existing document that contains a field with _id or _type, it will not work in version 2.x. You need to reindex your documents after dropping those fields.

  • The default date format has changed from date_optional_time to strict_date_optional_time, which expects a four-digit year, and a two-digit month and day, (and optionally, a two-digit hour, minute, and second). So a dynamic index set as "2016-01-01" will be stored inside Elasticsearch in "strict_date_optional_time||epoch_millis" format. Please note that if you have been using Elasticsearch older than 1.x then your date range queries might get impacted because of this. For example, if in Elasticsearch 1.x, you have two documents indexed with one having the date as 2017-02-28T12:00:00.000Z and the second having the date as 2017-03-01T11:59:59.000Z, and if you are searching for documents between February 28, 2017 and March 1, 2017, the following query could return both the documents:

{ 
    "range": { 
      "created_at": { 
        "gte": "2017-02-28", 
        "lte": "2017-03-01" 
      } 
    } 
  } 

But in version 2.0 onwards, the same query must use the complete date time to get the same results. For example.

{ 
    "range": { 
      "created_at": { 
        "gte": "2017-02-28T00:00:00.000Z", 
        "lte": "2017-03-01T11:59:59.000Z" 
      } 
    } 
  } 

In addition, you can also use the date match operation in combination with date rounding to get the same results as following query:

{ 
    "range": { 
      "doc.created_at": { 
        "lte": "2017-02-28||+1d/d", 
        "gte": "2017-02-28", 
        "format": "strict_date_optional_time" 
      } 
    } 
  } 

Query and filter changes

Prior to version 2.0.0, Elasticsearch had two different objects for querying data: queries and filters. Each was different in functionality and performance.

Queries were used to find out how relevant a document was to a particular query by calculating a score for each document. Filters were used to match certain criteria and were cacheable to enable faster execution. This means that if a filter matched 1,000 documents, Elasticsearch, with the help of bloom filters, would cache those documents in memory to retrieve them quickly in case the same filter was executed again.

However, with the release of Lucene 5.0, which is used by Elasticsearch version 2.0.0, both queries and filters became the same internal object, taking care of both document relevance and matching.

So, an Elasticsearch query that used to look like the following:

{ 
"filtered" : { 
"query": { query definition }, 
"filter": { filter definition } 
 } 
} 

It should now be written like this in version 2.x:

{ 
"bool" : { 
"must": { query definition }, 
"filter": { filter definition } 
} 
} 

Additionally, the confusion caused by choosing between a bool filter and an and / or filter has been addressed with the elimination of and / or filters, and replaced by the bool query syntax in the preceding example. Rather than the unnecessary caching and memory requirements that often resulted from a wrong filter, Elasticsearch now tracks and optimizes frequently used filters and doesn't cache for segments with less than 10,000 documents or 3% of the index.

Security, reliability, and networking changes

Starting from 2.x, Elasticsearch now runs under the Java Security Manager enabled by default, which streamlines permissions after startup.

Elasticsearch has applied a durable-by-default approach to reliability and data duplication across multiple nodes. Documents are now synced to disk before indexing requests are acknowledged, and all file renames are now atomic to prevent partially written files.

On the networking side, based on extensive feedback from system administrators, Elasticsearch removed multicasting, and the default zen discovery has been changed to unicast. Elasticsearch also now binds to the localhost by default, preventing unconfigured nodes from joining public networks.

Monitoring parameter changes

Before version 2.0.0, Elasticsearch used the SIGAR library for operating system-dependent statistics. But SIGAR is no longer maintained, and it has been replaced in Elasticsearch by a reliance on stats provided by JVM. Accordingly, we see various changes in the monitoring parameters of the node info and node stats APIs:

  • network.* has been removed from nodes info and nodes stats.

  • fs.*.dev and fs.*.disk* have been removed from nodes stats.

  • os.* has been removed from nodes stats, except for os.timestamp, os.load_average, os.mem.*, and os.swap.*.

  • os.mem.total and os.swap.total have been removed from nodes info.

  • From the _stats API, id_cache parameter, which tells about parent-child data structure memory, usage has also been removed. The id_cache can now be fetched from fielddata.

Changes between 2.x to 5.x

Elasticsearch 2.x did not see too many releases in comparison to the 1.x series. The last release under 2.x was 2.3.4 and since then Elasticsearch 5.0 was released. The following are the most important changes an existing Elasticsearch user must know before adapting to the latest releases.

Note

Elasticsearch 5.x requires Java 8 so make sure to upgrade your Java versions before getting started with Elasticsearch.

Mapping changes

From a user's perspective, changes under mappings are the most important changes to know because a wrong mapping will disallow index creation or can lead to unwanted search. Here are the most important changes under this category that you need to know.

No more string fields

The string type is removed in favor of the text and keyword data type. In earlier versions of Elasticsearch, the default mapping for string based fields looked like the following:

      { 
         "content" : { 
            "type" : "string" 
         } 
      } 

Starting from version 5.0, the same will be created using the following syntax:

        { 
          "content" : { 
            "type" : "text", 
            "fields" : { 
              "keyword" : { 
                "type" : "keyword", 
                "ignore_above" : 256 
              } 
            } 
          } 
        } 

This allows you to perform a full-text search on the original field name and to sort and run aggregations on the sub-keyword field.

Note

Multi-fields are enabled by default for string-based fields and can cause extra overhead if a user is relying on dynamic mapping generation.

However, if you want to create specific mapping for string fields for full-text searches, it will be created as shown in the following example:

      { 
            "content" : { 
              "type" : "string" 
            } 
      } 

Similarly, a not_analyzed string field needs to be created using the following mapping:

      { 
            "content" : { 
              "type" : "keyword" 
            } 
      } 

Note

On all field data types (except for the deprecated string field), the index property now only accepts true/false instead of not_analyzed/no.

Floats are default

Earlier, the default data type for decimal fields used to be double but now it has been changed to float.

Changes in numeric fields

Numeric fields are now indexed with a completely different data structure, called the BKD tree. This is expected to require less disk space and be faster for range queries. You can read the details at the following link:

https://www.elastic.co/blog/lucene-points-6.0

Changes in geo_point fields

Similar to numeric fields, the geo_point field now also uses the new BKD tree structure and field parameters for geo_point fields are no longer supported: geohash, geohash_prefix, geohash_precision, and lat_lon. Geohashes are still supported from an API perspective, and can still be accessed using the .geohash field extension, but they are no longer used to index geo point data.

For example, in previous versions of Elasticsearch, the mapping of a geo_point field could look like the following:

"location":{ 
      "type": "geo_point", 
      "lat_lon": true, 
      "geohash": true, 
      "geohash_prefix": true, 
      "geohash_precision": "1m" 
} 

But, starting from Elasticsearch version 5.0, you can only create mapping of a geo_point field as shown in the following:

"location":{ 
      "type": "geo_point" 
    }  

Some more changes

The following are some very important additional changes you should be aware about:

  • Removal of site plugins. The support of site plugins has been completely removed from Elasticsearch 5.0.

  • Node clients are completely removed from Elasticsearch as they are considered really bad from a security perspective.

  • Every Elasticsearch node, by default, binds to the localhost and if you change the bind address to some non-localhost IP address, Elasticsearch considers the node as production-ready and applies various Bootstrap checks when the Elasticsearch node starts. This is done to prevent your cluster from being blown away in future if you forget to allocate enough resources to Elasticsearch. The following are some of the Bootstrap checks Elasticsearch applies: maximum number of file descriptors check, maximum map count check, and heap size check. Please go to this URL to ensure that you have set all the parameters for Bootstrap checks to be passed https://www.elastic.co/guide/en/elasticsearch/reference/master/bootstrap-checks.html.

    Note

    Please note that if you are using OpenVZ virtualization on your servers, then you may find it difficult in setting the maximum map count for running Elasticsearch in the production mode, as this virtualization does not easily allow you to edit the kernel parameters. So you should either speak to your sysadmin to configure vm.max_map_count correctly, or move to a platform where you can set it, for example kvm VPS.

  • _optimize endpoint which was deprecated in 2.x is finally removed and has been replaced by the Force Merge API. For example, an optimize request in version 1.x...

curl -XPOST 'http://localhost:9200/test/_optimize?max_num_segments=5'

...should be converted to:


     curl -XPOST 'http://localhost:9200/test/_forcemerge?max_num_segments=5'
    

In addition to these changes, some major changes have been done in search, settings, allocation, merge, and scripting modules, along with cat and Java APIs, which we will cover in subsequent chapters.