Book Image

Advanced Elasticsearch 7.0

By : Wai Tak Wong
Book Image

Advanced Elasticsearch 7.0

By: Wai Tak Wong

Overview of this book

Building enterprise-grade distributed applications and executing systematic search operations call for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and understand how you can develop a sophisticated, real-time search engine confidently. In addition to this, you'll also learn to run machine learning jobs in Elasticsearch to speed up routine tasks. You'll get started by learning to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of query results and enhancing the customer experience. You'll then get up to speed with performing analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision-makers with better insights. The book will later guide you through using Logstash with examples to collect, parse, and enrich logs before indexing them in Elasticsearch. By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.
Table of Contents (25 chapters)
Free Chapter
1
Section 1: Fundamentals and Core APIs
8
Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics
13
Section 3: Programming with the Elasticsearch Client
16
Section 4: Elastic Stack
20
Section 5: Advanced Features

API conventions

We will only discuss some of the major conventions. For others, please refer to the Elasticsearch reference (https://www.elastic.co/guide/en/elasticsearch/reference/master/api-conventions.html). The following list can be applied throughout the REST API:

  • Access across multiple indices: This convention cannot be used in single document APIs:
    • _all: For all indices
    • comma: A separator between two indices
    • wildcard (*,-): The asterisk character, *, is used to match any sequence of characters in the index name, excluding the index afterwards
  • Common options:
    • Boolean values: false means the mentioned value is false; true means the value is true.
    • Number values: A number is as a string on top of the native JSON number type.
    • Time unit for duration: The supported time units are d for days, h for hours, m for minutes, s for seconds, ms for milliseconds, micros for microseconds, and nanos for nanoseconds.
    • Byte size unit: The supported data units are b for bytes, kb for kilobytes, mb for megabytes, gb for gigabytes, tb for terabytes, and pb for petabytes.
    • Distance unit: The supported distance units are mi for miles, yd for yards, ft for feet, in for inches, km for kilometers, m for meters, cm for centimeters, mm for millimeters, and nmi or NM for nautical miles.
    • Unit-less quantities: If the value specified is large enough, we can use a quantity as a multiplier. The supported quantities are k for kilo, m for mega, g for giga, t for tera, and p for peta. For instance, 10m represents the value 10,000,000.
    • Human-readable output: Values can be converted to human-readable values, such as 1h for 1 hour and 1kb for 1,024 kilobytes. This option can be turned on by adding ?human=true to the query string. The default value is false.
    • Pretty result: If you append ?pretty=true to the request URL, the JSON string in the response will be in pretty format.
    • REST parameters: Follow the convention of using underscore delimiting.
    • Content type: The type of content in the request body must be specified in the request header using the Content-Type key name. Check the reference as to whether the content type you use is supported. In all our POST/UPDATE/PATCH request examples, application/json is used.
    • Request body in query string: If the client library does not accept a request body for non-POST requests, you can use the source query string parameter to pass the request body and specify the source_content_type parameter with a supported media type.
    • Stack traces: If the error_trace=true request URL parameter is set, the error stack trace will be included in the response when an exception is raised.
  • Date math in a formatted date value: In range queries or in date range aggregations, you can format date fields using date math:
    • The date math expressions start with an anchor date (now, or a date string ending with a double vertical bar: ||), followed by one or more sub-expressions such as +1h, -1d, or /d.
    • The supported time units are different from the time units for duration in the previously mentioned Common options bullet list. Where y is for years, M is for months, w is for weeks, d is for days, h, or H is for hours, m is for minutes, s is for seconds, + is for addition, - is for subtraction, and / is for rounding down to the nearest time unit. For example, this means that /d means rounding down to the nearest day.
For the following discussion of these data parameters, assume that the current system time now is 2019.01.03 01:20:00, now+1h is 2019.01.03 02:20:00, now-1d is 2019.01.02 01:20:00, now/d is 2019.01.03 00:00:00, now/M is 2019.01.01 00:00:00, 2019.01.03 01:20:00||+1h is 2019.01.03 02:20:00, and so forth.
  • Date math in index name: If you want to index time series data, such as logs, you can use a pattern with different date fields as the index names to manage daily logging information. Date math then gives you a way to search through a series of time series indices. The date math syntax for the index name is as follows:
<static_name{date_math_expr{date_format|time_zone}}>

The following are the terms used in the preceding syntax:

    • static_name: The unchanged text portion of the index name.
    • date_math_expr: The changing text portion of the index name according to the date math to vary.
    • date_format: The default value is YYYY.MM.dd, where YYYYY stands for the year, MM for the month, and dd for the day.
    • time_zone: The time zone offset and the default time zone is UTC. For instance, the UTC time offset is -08:00 for PST.
      Given that the current system time is 1:00 PM, January 3, 2019, the index name interpreted from the date math is expressed by <logstash-{now/d{YYYY.MM.dd|+12:00}} and is logstash-2019.1.4, where now/d means the current system time rounded down to the nearest day.
  • URL-based access control: There are many APIs in Elasticsearch that allow you to specify the index in the request body, such as multi-search, multi-get, and a Bulk request. By default, the index specified in the request body will override the index parameter specified in the URL. If you use a proxy with URL-based access control to protect access to Elasticsearch indices, you can add the following setting to the elasticsearch.yml configuration file to disable the default action:
rest.action.multi.allow_explicit_index: false

For other concerns or detailed usage, check out the official Elasticsearch reference (https://www.elastic.co/guide/en/elasticsearch/reference/master/api-conventions.html). In the next section, we will review the new features in version 7.0.0.