Book Image

Advanced Elasticsearch 7.0

By : Wai Tak Wong
Book Image

Advanced Elasticsearch 7.0

By: Wai Tak Wong

Overview of this book

Building enterprise-grade distributed applications and executing systematic search operations call for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and understand how you can develop a sophisticated, real-time search engine confidently. In addition to this, you'll also learn to run machine learning jobs in Elasticsearch to speed up routine tasks. You'll get started by learning to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of query results and enhancing the customer experience. You'll then get up to speed with performing analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision-makers with better insights. The book will later guide you through using Logstash with examples to collect, parse, and enrich logs before indexing them in Elasticsearch. By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.
Table of Contents (25 chapters)
Free Chapter
1
Section 1: Fundamentals and Core APIs
8
Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics
13
Section 3: Programming with the Elasticsearch Client
16
Section 4: Elastic Stack
20
Section 5: Advanced Features

Breaking changes

Aggregations changes

The changes related to aggregation are as follows:

  • The execution hints (global_ordinals_hash and global_ordinals_low_cardinality) for the term aggregations are eliminated.
  • The max limit of buckets allowed in a single response for bucket aggregations is controlled by the search.max_buckets cluster setting, with the default value of 10,000. An attempt to return a request that exceeds the limit will fail with an exception.
  • You should use the missing_bucket option instead of the missing of the parameter sources in the composite aggregation to include documents that have no value in the response. The deprecated missing option is eliminated.
  • The params._agg script parameter, or params._aggs in the scripted metric aggregation, should be replaced by the new ScriptContext state and states variables.
  • In previous versions, the map_script parameter was the only parameter required in the Script Metric Aggregation. Now, the combine_script and reduce_script parameters are also required.
  • The response of percentiles and percentile_ranks aggregation will return null instead of NaN if its input is empty.
  • The response of stats and extended_stats aggregation will return 0 instead null if its input is empty.

Analysis changes

The changes related to analysis are as follows:

  • The max limit for tokens that can be obtained in the _analyze API is 10000.
  • The max limit for input characters analyzed during highlighting is 1000000.
  • Use the delimited_payload parameter for the delimited payload token filter, instead of the deprecated delimited_payload_filter. For existing pre-7.0 indices, a deprecation warning is logged. The new index will fail with an exception.
  • The standard filter is eliminated.
  • The standard_html_strip analyzer is deprecated.
  • Using the deprecated nGram and edgeNGram token filter will throw an error. Use the name ngram and edge_ngram respectively instead.

API changes

The changes related to APIs are as follows:

  • The internal versioning support for optimistic concurrency control is eliminated.
  • In the document bulk API, use the retry_on_conflict parameter instead of _retry_on_conflict; use routing instead of _routing; use version instead of _version; and use version_type instead of _version_type. Use the join meta-field instead of the _parent in mapping. All previous underscore parameters are eliminated. The camel-case parameters such as opType, versionType, and _versionType have been eliminated.
  • The cat thread pool API has renamed some field names from 6.x to 7.0 to align the meaning in the fixed thread pools and scaling thread pools. Use pool_size instead of the original size and core instead of the original min. For the corresponding alias, use psz instead of s, and cr instead of mi. In addition, the alias for max has changed from ma to mx. A new size field that represents the configured fixed number of active threads allowed in the current thread pool is introduced.
  • For the bulk request and update request, if a request contains an unknown parameter, a Bad Request (400) response will be returned.
  • The feature for the Suggest statistics obtained during the Search statistics operation on the indices stats _stats API is eliminated.
  • The copy_setting parameter in the split index operation will be removed in 8.0.0. These settings are copied by default during the operation.
  • Instead of using the stored search template _search API, you must use the stored script _scripts API to register search templates. The search template name must be provided.
  • Previously, the response status of the index alias API depended on whether the security feature was turned on or off. Now, an empty response with a status of OK (200) is always returned.
  • The feature for the response object to create a user using the /_xpack/security/user API with an additional field created outside the user field is eliminated.
  • Use the corrected URL _source_excludes and _source_includes parameters instead of the original _source_exclude and _source_include parameters in the query.
  • Unknown keys in the multi search _msearch API were ignored before, but will fail with an exception now.
  • The graph /_graph/_explore API is eliminated.
  • Term vector can be used to return information and statistics in specific document fields in the document API. Use the corrected plural-form _termvectors method instead of the singular form, _termvector.
  • The Index Monitoring APIs are not authorized implicitly anymore. The privileges must be granted explicitly.
  • The deprecated parameter fields of the bulk request is eliminated.
  • If the document is missing when the PUT Document API is used with version number X, the error message is different from previous version. The new message is shown in the code block below:
 document does not exist (expected version [X]).
  • The compressed_size and compressed_size_in_bytes fields are removed from the Cluster State API response.
  • The Migration Assistance API is removed.
  • When the cluster is configured as read-only, 200 status will be returned for a GET request.
  • The Clear Cache API support POST or GET request previously. Using GET request for such API is eliminated.

Cluster changes

The changes related to cluster are as follows:

  • The colon (:) is not a valid character for the cluster name anymore due to cross-cluster search support.
  • The number of allocated shards (wait_for_active_shards) that must be ready before the open index API can be proceeded has been incremented from 0 to 1.
  • The shard preferences in the search APIs, including _primary, _primary_first, _replica, and _replica_first, are eliminated.
  • The cluster-wide shard limit used to prevent user error now depends on the value of max_shards_per_node * number_of_nodes.

Discovery changes

The changes related to Discovery are as follows:

  • The cluster.initial_master_nodes setting must be set before cluster bootstrapping is performed.
  • If half or more of the master-eligible nodes are going to remove from a cluster, those affected nodes must be excluded from the voting configuration using the _cluster/voting_config_exclusions API.
  • At least one of the following settings must be specified in the elastiscearch.yml configuration file.
    • discovery.seed_hosts
    • discovery.seed_providers
    • cluster.initial_master_nodes
    • discovery.zen.ping.unicast.hosts
    • discovery.zen.hosts_provider
  • Use the setting name cluster.no_master_block instead of discovery.zen.no_master_block, which is deprecated.
  • The default timeout for heartbeat fault detection ping operation between cluster nodes is 10 seconds instead of 30 seconds.

High-level REST client changes

The changes related to the high-level REST client are as follows:

  • Methods that accept headers as the header varargs argument have been eliminated from the RestHighLevelClient class.
  • Previously, the cluster health API was a shard-level base, but now it is a cluster-level base.

Low-level REST client changes

The changes related to low-level REST client are as follows:

  • The maxRetryTimeout setting of the RestClient and RestClientBuilder class is eliminated.
  • Methods that do not take Request objects, such as performRequest and performRequestAsync, have been eliminated from the RestClient class.
  • The setHosts method is removed from the RestClient class.
  • The minimum compiler version is bumped to JDK 8.

Indices changes

The changes related to indices are as follows:

  • By default, each index in Elasticsearch is allocated 1 primary shard and 1 replica.
  • The colon (:) is no longer a valid character in the index name anymore due to the cross-cluster search support.
  • Negative values for index.unassigned.node_left.delayed_timeout settings are treated as zero.
  • The undocumented side effects from a _flush or a _force_merge operation have been fixed.
  • The difference between max_ngram and min_ngram in NGramTokenFilter and NGramTokenizer is limited to 1 before. This default limit can be changed with the index.max_ngram_diff index setting. If the limit is exceeded, it will fail with an exception.
  • The difference between max_shingle_size and min_shingle_size in ShingleTokenFilter was limited to 3 before. This default limit can be changed with the index.max_shingle_diff index setting. If the difference exceeds the limit, it will fail with an exception.
  • New indices created in version 7.0.0 will have a default value for the number_of_routing_shards parameter. The requirement of the split index API for the source index must be associated with this setting . In order to maintain the exact same distribution as a pre-7.0.0 index, you must make sure the values in the split index API and the value at the index creation time are the same.
  • Background refreshing is disabled. If you don't set the value of index.refresh_interval, no refresh operation will be acted on for the search idle shards.
  • The Clear Cache API allows you to clear all caches, or just specific caches. The original usage of the specific cache name is eliminated. Use query instead of query_cache or filter_cache. Use request instead of request_cache. Use fielddata instead of field_data.
  • The network.breaker.inflight_requests.overhead setting has changed from 1 to 2. The estimated memory usage limit of all currently active incoming requests at transport or HTTP level on a node has been increased.
  • The parent circuit breaker defines a new setting indices.breaker.total.use_real_memory. The starting limit for the overall parent breaker indices.breaker.total.limit is 95% of the JVM heap if it is true (default), otherwise it is 70%.
  • The field data limit for the circuit breaker of index indices.breaker.fielddata.limit has been reduced from 60% to 40% of the maximum JVM heap by default.
  • The fix option of the index setting index.shard.check_on_startup, which checks the corruption of shard, has been eliminated.
  • The elasticsearch-translog tool has been eliminated. Use the elasticsearch-shard tool instead.

Java API changes

The changes related to the Java API are as follows:

  • Use the isShardsAcknowledged() method instead of the isShardsAcked() method in the CreateIndexResponse, RolloverResponse, and CreateIndexClusterStateUpdateResponse classes. The isShardsAcked() method is eliminated.
  • The aggregation framework has had some classes moved upward. The new location of the classes in org.elasticsearch.search.aggregations.metrics.* packages is under the org.elasticsearch.search.aggregations.metrics package. The new location of the classes in org.elasticsearch.search.aggregations.pipeline.* packages is under the org.elasticsearch.search.aggregations.pipeline package. The new location of the org.elasticsearch.search.aggregations.pipeline.PipelineAggregationBuilders class is under the org.elasticsearch.search.aggregations package.
  • Regarding the org.elasticsearch.action.bulk.Retry class, the withBackoff() method usage with the Settings field is eliminated.
  • Regarding the Java client class, use the method name of the plural form, termVectors(), instead of the singular form, termVector().
  • The prepareExecute() method has also been eliminated.
  • The deprecated constructor AbstractLifeCycleComponent(Settings settings) is eliminated.

Mapping changes

The changes related to mapping are as follows:

  • The original indexing meta field, _all, which indexed the values of all fields, has been eliminated.
  • The original indexing meta field, _uid, which combined _type and _id, has been eliminated.
  • The original default mapping meta field, _default_, which was used as the base mapping for any new mapping type, has been eliminated.
  • For search and highlighting purposes, the index_options parameter controls which information has been added to the inverted index. However, it no longer supports numeric fields.
  • The max limit of nested JSON objects within a single document across all nested fields is 10000.
  • In the past, specifying that the update_all_types parameter update the mappings would update all fields with the same name of all _type in the same index. It has been eliminated.
  • The classic similarity feature, which is based on the TF/IDF to define how matching documents are scored, has been eliminated since it is no longer supported by Lucene.
  • The error for providing unknown similarity parameters in the request will fail with exception.
  • The geo_shape datatypes in the indexing strategy now defaults to using a vector-indexing approach based on Lucene's new LatLonShape field type.
  • Most options of the geo_shape mapping will be eliminated in a future version. They are tree, precision, tree_levels, strategy, distance_error_pct, and points_only.
  • The max limit of completion context is 10. A deprecation warning will be logged if the setting exceeds.
  • The default value of include_type_name has changed from true to false.
If you use tree as a mapping option for geo_shape mapping and also use a timed index created from a template, you must set geohash or quadtree as the option to ensure compatibility with your previously created indices.

ML changes

The change related to machine learning is as follow:

  • Types parameter is eliminated from the datafeed configuration

Packaging changes

The changes related to packaging are as follows:

  • If using rpm of deb package, to override the settings of the systemd elasticsearch service, it should be made in /etc/systemd/system/elasticsearch.service.d/override.conf
  • The tar package will not include the files in bin directory for Window platform
  • Stop supporting Ubuntu 14.04 version
  • Stop supporting secrets input from command line input

Search changes

The changes related to search are as follows:

  • By default, the adaptive replica selection, cluster.routing.use_adaptive_replica_selection, is enabled to send copies of data to replicas. You may disable it to use the old round-robin method as in 6.x.
  • In the following error situations, a bad request (400) will be returned instead of an internal server error (500):
    • The resulting window is too large, from + size must be less than or equal to: [x] but was [y].
    • Cannot use the [sort] option in conjunction with [rescore].
    • The rescore window, [x], is too large.
    • The number of slices, [x], is too large.
    • Keep alive for scroll, [x], is too large.
    • In adjacency matrix aggregation, the number of filters exceeds the max limit.
    • An org.elasticsearch.script.ScriptException compile error.
  • The request_cache setting in the scroll search is eliminated. A bad request (400) will be returned if you still use it.
  • The method of including a rescore clause on a query to create a scroll search is eliminated. A bad request (400) will be returned if you still use it.
  • Use the corrected name, levenshtein, instead of levenstein, and jaro_winkler instead of jarowinkler, for the string_distance term suggest options in the term suggester.
  • The meaning of suggest_mode=popular in the suggesters (term and phrase) is now the doc frequency from the input terms to compute the frequency threshold for candidate suggestions.
  • Search requests that contain extra tokens after the main object will fail with a parsing exception.
  • The completion suggester provides an auto-complete/search-as-you-type feature. When indexing and querying a context-enabled completion field, you must provide a context.
  • The semantics of max_concurrent_shard_requests has changed from cluster level to node level. The default number of concurrent shard requests per node is 5.
  • The format of the total number of documents that matches the search criteria in the response has changed from a value type to an object of a value and a relation.
  • When track_total_hits is set to false in the search request, the total number of matching documents (hits.total) in the response will return null instead of -1. You may set the option as rest_total_hits_as_int=true in the request to return to the old format.
  • The track_total_hits defaults to 10,000 documents in the search response.
  • The default format for doc-value field is switched back to 6.x style. The Date field can take any date format and the Numeric fields can take a DecimalFormat pattern.
  • For geo context completion suggester, the context is only accepted if the path parameter points to a field with geo_point type.

Query DSL changes

The changes related to query DSL are as follows:

  • The default value of the transposition parameter in a fuzzy query is changed from false to true.
  • The query string query options of use_dis_max, split_on_whitespace, all_fields, locale, auto_generate_phrase_queries, and lowercase_expanded_terms have all been eliminated.
  • If a bool query has the must_not clause, a score of 0 for all documents is returned instead of 1 because the scoring is ignored.
  • Treats geohashes as grid cells, instead of just points, when the geohashes are used to specify the edges in the geo_bounding_box query.
  • A multi-term query (a wildcard, fuzzy, prefix, range, or regex query) against non-text fields with a custom analyzer will now throw an exception.
  • If the resulting polygon crosses the dateline, the GeoJSON standard will be applied to the geo_shape query to disambiguate the misleading results.
  • Boost settings are not allowed on complex inner span queries.
  • The number of terms in the terms query (index.max_terms_count) is limited to 65536.
  • The maximum length of a regex string (index.max_regex_length) allowed in a regex query is limited to 1000.
  • No more than 1,024 fields can be queried at a time. It also limits the auto explanation of fields in the query_string query, the simple_query_string query, and the multi_match query.
  • When a score cannot be tracked, the return value of max_score will return null instead of 0.
  • Boosting is a process that enhances the document relevance. The matching document placed at the top of the result can be given a negative boost value to move it to the last position. Negative boosting support is eliminated.
  • The score generated by the script_score_function or field_value_factor must be non-negative, otherwise it will fail with an exception.
  • The difference between the query and filter context in QueryBuilders is eliminated. Therefore, bool queries with should clauses that don't require access to scores do not need to set the minimum_should_match to 1 .
  • More constraints on the scores value. It must not be negative, must not decrease when term freq increases, and must not increase when norm increases.
  • Negative support for the weight parameters for the function_score query is eliminated.

Settings changes

The changes related to settings are as follows:

  • The default node.name is the hostname instead of the first eight characters of the node _id.
  • Use the index.percolator.map_unmapped_fields_as_text setting instead of the deprecated index.percolator.map_unmapped_fields_as_string setting to force unmapped fields to be handled as strings in a percolate query.
  • Since the indexing thread pool no longer exists, the thread_pool.index.size and thread_pool.index.queue_size settings have been removed.
  • The thread_pool.bulk.size, thread_pool.bulk.queue_size, and es.thread_pool.write.use_bulk_as_display_name settings, which were supported as the fallback settings have been removed.
  • Use node.store.allow_mmap instead of node.store.allow_mmapfs to restrict the use of the mmapfs or the hybridfs store type of indices.
  • The HTTP on/off switch setting http.enabled has been eliminated.
  • The HTTP pipeline support has been eliminated. However, the http.pipelining.max_events setting is still the same as in the previous version.
  • The setting name search.remote.* used to configure cross-cluster search was renamed to cluster.remote.*. The previous setting names fall back in version 7.0.0 and will be removed in version 8.0.0.
  • To audit local node information security settings, you must use xpack.security.audit.logfile.emit_node_host_address instead of the deprecated xpack.security.audit.logfile.prefix.emit_node_host_address; use xpack.security.audit.logfile.prefix.emit_node_host_name instead of the deprecated xpack.security.audit.logfile.emit_node_host_names; and use xpack.security.audit.logfile.prefix.emit_node_name instead of the deprecated xpack.security.audit.logfile.emit_node_name. In addition, the default value of xpack.security.audit.logfile.emit_node_name has changed from true to false.
  • For all security realm settings, instead of using the explicit type setting, the realm type must be part of the setting name. Consider the following for instance:
xpack.security.authc.realms:
realm1:
type: ldap
order: 0
...
realm2:
type: native
...

This must be updated as follows:

xpack.security.authc.realms:
ldap.realm1:
order: 0
...
native.realm2:
...
  • The default TLS/SSL settings are removed.
  • The TLS v1.0 is disabled by default.
  • The security is only enabled if xpack.security.enabled is true, or xpack.security.enabled is not set, and a gold or platinum license is installed.
  • Some of the security settings' names are changed, you must use xpack.notification.email.account.<id>.smtp.password instead of xpack.notification.email.account.<id>.smtp.secure_password, xpack.notification.hipchat.account.<id>.auth_token instead of xpack.notification.hipchat.account.<id>.secure_auth_token, xpack.notification.jira.account.<id>.url instead of xpack.notification.jira.account.<id>.secure_url, xpack.notification.jira.account.<id>.user instead of xpack.notification.jira.account.<id>.secure_user, xpack.notification.jira.account.<id>.password instead of xpack.notification.jira.account.<id>.secure_password, xpack.notification.pagerduty.account.<id>.service_api_key instead of xpack.notification.pagerduty.account.<id>.secure_service_api_key, xpack.notification.slack.account.<id>.url instead of xpack.notification.slack.account.<id>.secure_url.
  • The settings under the xpack.security.audit.index and xpack.security.audit.outputs namespace and have been removed.
  • The ecs setting for the user agent ingest processor now defaults to true.
  • The action.master.force_local setting is removed.
  • The limit of cluster-wide shard number is now enforced, not optional.
  • If http.max_content_length is set to Integer.MAX, it will not be reset to 100mb.

Scripting changes

The changes related to scripting are as follows:

  • The getter methods for the date class have been eliminated. Use .value instead of .date on the date fields. For instance, use doc['start_time'].value.minuteOfHour instead of doc['start_time'].date.minuteOfHour.
  • Accessing the missing field of the document will fail with an exception. To check if a document is missing values, you can use doc['field_name'].size() == 0.
  • A bad request (400) instead of an internal error (500) is returned for malformed scripts in search templates, ingest pipelines, and search requests.
  • The deprecated getValues() method of the ScriptDocValues class has been eliminated. Use doc["field_name"] instead of doc["field_name"].values.

If an upgrade is needed, follow the advices for the migration between versions in the next section.