Book Image

Advanced Elasticsearch 7.0

By : Wai Tak Wong
Book Image

Advanced Elasticsearch 7.0

By: Wai Tak Wong

Overview of this book

Building enterprise-grade distributed applications and executing systematic search operations call for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and understand how you can develop a sophisticated, real-time search engine confidently. In addition to this, you'll also learn to run machine learning jobs in Elasticsearch to speed up routine tasks. You'll get started by learning to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of query results and enhancing the customer experience. You'll then get up to speed with performing analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision-makers with better insights. The book will later guide you through using Logstash with examples to collect, parse, and enrich logs before indexing them in Elasticsearch. By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.
Table of Contents (25 chapters)
Free Chapter
1
Section 1: Fundamentals and Core APIs
8
Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics
13
Section 3: Programming with the Elasticsearch Client
16
Section 4: Elastic Stack
20
Section 5: Advanced Features

Key concepts

In the previous section, we learned some core concepts such as clusters, nodes, shards, replicas, and so on. We will briefly introduce the other key concepts in this section. Then, we'll drill down into the details in subsequent chapters.

Mapping concepts across SQL and Elasticsearch

In the early stages of Elasticsearch, mapping types were a way to divide the documents into different logical groups in the same index. This meant that the index could have any number of types. In the past, talking about index in Elasticsearch is similar to talking about database in SQL. In addition, the discussion of viewing index type in Elasticsearch is equivalent to a table in SQL is also very popular. According to the official Elastic website (https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html), the removal of mapping types was published in the documentation of version 5.6. Later, in Elasticsearch 6.0.0, indices needed to contain only one mapping type. Mapping types of the same index were completely removed in Elasticsearch 7.0.0. The main reason was that tables are independent of each other in an SQL database. However, fields with the same name in different mapping types of the same index are the same. In an Elasticsearch index, fields with the same name in different mapping types are internally supported by the same Lucene field.

Let's take a look at the terminology in SQL and Elasticsearch in the following table(https://www.elastic.co/guide/en/elasticsearch/reference/master/_mapping_concepts_across_sql_and_elasticsearch.html), showing how the data is organized:

SQL Elasticsearch Description
Column Field A column is a set of data values in the same data type, with one value for each row of the database, while Elasticsearch refers to as a field. A field is the smallest unit of data in Elasticsearch. It can contain a list of multiple values of the same type.
Row Document A row represents a structured data item, which contains a series of data values from each column of the table. A document is like a row to group fields (columns in SQL). A document is a JSON object in Elasticsearch.
Table Index A table consists of columns and rows. An index is the largest unit of data in Elasticsearch. Comparing to a database in SQL, an index is a logical partition of the indexed documents and the target against which the search queries get executed.
Schema Implicit In a relational database management system (RDBMS), a schema contains schema objects, which can be tables, columns, data types, views, and so on. A schema is typically owned by a database user. Elasticsearch does not provide an equivalent concept for it.
Catalog/database Cluster In SQL, a catalog or database represents a set of schemas. In Elasticsearch, a cluster contains a set of indices.

Mapping

A schema could mean an outline, diagram, or model, which is often used to describe the structure of different types of data. Elasticsearch is reputed to be schema-less, in contrast to traditional relational databases. In traditional relational databases, you must explicitly specify tables, fields, and field types. In Elasticsearch, schema-less simply means that the document can be indexed without specifying the schema in advance. Under the hood though, Elasticsearch dynamically derives a schema from the first document's index structure and decides how to index them when no explicit static mapping is specified. Elasticsearch enforces the term schema called mapping, which is a process of defining how Lucene stores the indexed document and those fields it contains. When you add a new field to your document, the mapping will also be automatically updated.

Starting from Elasticsearch 6.0.0, only one mapping type is allowed for each index. The mapping type has fields defined by data types and meta fields. Elasticsearch supports many different data types for fields in a document. Each document has meta-fields associated with it. We can customize the behavior of the meta-fields when creating a mapping type. We'll cover this in Chapter 4, Mapping APIs.

Analyzer

Elasticsearch comes with a variety of built-in analyzers that can be used in any index without further configuration. If the built-in analyzers are not suitable for your use case, you can create a custom analyzer. Whether it is a built-in analyzer or a customized analyzer, it is just a package of the three following lower-level building blocks:

  • Character filter: Receives the raw text as a stream of characters and can transform the stream by adding, removing, or changing its characters
  • Tokenizers: Splits the given streams of characters into a token stream
  • Token filters: Receives the token stream and may add, remove, or change tokens

The same analyzer should normally be used both at index time and at search time, but you can set search_analyzer in the field mapping to perform different analyses while searching.

Standard analyzer

The standard analyzer is the default analyzer, which is used if none is specified. A standard analyzer consists of the following:

  • Character filter: None
  • Tokenizer: Standard tokenizer
  • Token filters: Lowercase token filter and stop token filter (disabled by default)

A standard tokenizer provides a grammar-based tokenization. A lowercase token filter normalizes the token text to lowercase, where a stop token filter removes the stop words from token streams. For a list of English stop words, you can refer to https://www.ranks.nl/stopwords. Let's test the standard analyzer with the input text You'll love Elasticsearch 7.0.

Since it is a POST request, you need to set the Content-Type to application/json:

The URL is http://localhost:9200/_analyze and the request Body has a raw JSON string, {"text": "You will love Elasticsearch 7.0."}. You can see that the response has four tokens: you'll, love, elasticsearch, and 7.0, all in lowercase, which is due to the lowercase token filter:

In the next section, let's get familiar with the API conventions.