In the previous section, we learned some core concepts such as clusters, nodes, shards, replicas, and so on. We will briefly introduce the other key concepts in this section. Then, we'll drill down into the details in subsequent chapters.
Key concepts
Mapping concepts across SQL and Elasticsearch
In the early stages of Elasticsearch, mapping types were a way to divide the documents into different logical groups in the same index. This meant that the index could have any number of types. In the past, talking about index in Elasticsearch is similar to talking about database in SQL. In addition, the discussion of viewing index type in Elasticsearch is equivalent to a table in SQL is also very popular. According to the official Elastic website (https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html), the removal of mapping types was published in the documentation of version 5.6. Later, in Elasticsearch 6.0.0, indices needed to contain only one mapping type. Mapping types of the same index were completely removed in Elasticsearch 7.0.0. The main reason was that tables are independent of each other in an SQL database. However, fields with the same name in different mapping types of the same index are the same. In an Elasticsearch index, fields with the same name in different mapping types are internally supported by the same Lucene field.
Let's take a look at the terminology in SQL and Elasticsearch in the following table(https://www.elastic.co/guide/en/elasticsearch/reference/master/_mapping_concepts_across_sql_and_elasticsearch.html), showing how the data is organized:
SQL | Elasticsearch | Description |
Column | Field | A column is a set of data values in the same data type, with one value for each row of the database, while Elasticsearch refers to as a field. A field is the smallest unit of data in Elasticsearch. It can contain a list of multiple values of the same type. |
Row | Document | A row represents a structured data item, which contains a series of data values from each column of the table. A document is like a row to group fields (columns in SQL). A document is a JSON object in Elasticsearch. |
Table | Index | A table consists of columns and rows. An index is the largest unit of data in Elasticsearch. Comparing to a database in SQL, an index is a logical partition of the indexed documents and the target against which the search queries get executed. |
Schema | Implicit | In a relational database management system (RDBMS), a schema contains schema objects, which can be tables, columns, data types, views, and so on. A schema is typically owned by a database user. Elasticsearch does not provide an equivalent concept for it. |
Catalog/database | Cluster | In SQL, a catalog or database represents a set of schemas. In Elasticsearch, a cluster contains a set of indices. |
Mapping
A schema could mean an outline, diagram, or model, which is often used to describe the structure of different types of data. Elasticsearch is reputed to be schema-less, in contrast to traditional relational databases. In traditional relational databases, you must explicitly specify tables, fields, and field types. In Elasticsearch, schema-less simply means that the document can be indexed without specifying the schema in advance. Under the hood though, Elasticsearch dynamically derives a schema from the first document's index structure and decides how to index them when no explicit static mapping is specified. Elasticsearch enforces the term schema called mapping, which is a process of defining how Lucene stores the indexed document and those fields it contains. When you add a new field to your document, the mapping will also be automatically updated.
Starting from Elasticsearch 6.0.0, only one mapping type is allowed for each index. The mapping type has fields defined by data types and meta fields. Elasticsearch supports many different data types for fields in a document. Each document has meta-fields associated with it. We can customize the behavior of the meta-fields when creating a mapping type. We'll cover this in Chapter 4, Mapping APIs.
Analyzer
Elasticsearch comes with a variety of built-in analyzers that can be used in any index without further configuration. If the built-in analyzers are not suitable for your use case, you can create a custom analyzer. Whether it is a built-in analyzer or a customized analyzer, it is just a package of the three following lower-level building blocks:
- Character filter: Receives the raw text as a stream of characters and can transform the stream by adding, removing, or changing its characters
- Tokenizers: Splits the given streams of characters into a token stream
- Token filters: Receives the token stream and may add, remove, or change tokens
The same analyzer should normally be used both at index time and at search time, but you can set search_analyzer in the field mapping to perform different analyses while searching.
Standard analyzer
The standard analyzer is the default analyzer, which is used if none is specified. A standard analyzer consists of the following:
- Character filter: None
- Tokenizer: Standard tokenizer
- Token filters: Lowercase token filter and stop token filter (disabled by default)
A standard tokenizer provides a grammar-based tokenization. A lowercase token filter normalizes the token text to lowercase, where a stop token filter removes the stop words from token streams. For a list of English stop words, you can refer to https://www.ranks.nl/stopwords. Let's test the standard analyzer with the input text You'll love Elasticsearch 7.0.
Since it is a POST request, you need to set the Content-Type to application/json:
The URL is http://localhost:9200/_analyze and the request Body has a raw JSON string, {"text": "You will love Elasticsearch 7.0."}. You can see that the response has four tokens: you'll, love, elasticsearch, and 7.0, all in lowercase, which is due to the lowercase token filter:
In the next section, let's get familiar with the API conventions.