Book Image

Getting Started with Elastic Stack 8.0

By : Asjad Athick
Book Image

Getting Started with Elastic Stack 8.0

By: Asjad Athick

Overview of this book

The Elastic Stack helps you work with massive volumes of data to power use cases in the search, observability, and security solution areas. This three-part book starts with an introduction to the Elastic Stack with high-level commentary on the solutions the stack can be leveraged for. The second section focuses on each core component, giving you a detailed understanding of the component and the role it plays. You’ll start by working with Elasticsearch to ingest, search, analyze, and store data for your use cases. Next, you’ll look at Logstash, Beats, and Elastic Agent as components that can collect, transform, and load data. Later chapters help you use Kibana as an interface to consume Elastic solutions and interact with data on Elasticsearch. The last section explores the three main use cases offered on top of the Elastic Stack. You’ll start with a full-text search and look at real-world outcomes powered by search capabilities. Furthermore, you’ll learn how the stack can be used to monitor and observe large and complex IT environments. Finally, you’ll understand how to detect, prevent, and respond to security threats across your environment. The book ends by highlighting architecture best practices for successful Elastic Stack deployments. By the end of this book, you’ll be able to implement the Elastic Stack and derive value from it.
Table of Contents (18 chapters)
1
Section 1: Core Components
4
Section 2: Working with the Elastic Stack
12
Section 3: Building Solutions with the Elastic Stack

Collecting and ingesting data

So far, we have looked at Elasticsearch, a scalable search and analytics engine for all kinds of data. We have also got Kibana to interface with Elasticsearch to help us explore and use our data effectively. The final capability to make it all work together is ingestion.

The Elastic Stack provides two products for ingestion, depending on your use cases.

Collecting data from across your environment using Beats

Useful data is generated all over the place in present-day environments, often from varying technology stacks, as well as legacy and new systems. As such, it makes sense to collect data directly from, or closer to, the source system and ship it into your centralized logging or analytics platform. This is where Beats come in; Beats are lightweight applications (also referred to as agents) that can collect and ship several types of data to destinations such as Elasticsearch, Logstash, or Kafka.

Elastic offers a few types of Beats today for various use cases:

  • Filebeat: Collecting log data
  • Metricbeat: Collecting metric data
  • Packetbeat: Decoding and collecting network packet metadata
  • Heartbeat: Collecting system/service uptime and latency data
  • Auditbeat: Collecting OS audit data
  • Winlogbeat: Collecting Windows event, applicatio, and security logs
  • Functionbeat: Running data collection on serverless compute infrastructure such as AWS Lambda

Beats use an open source library called libbeat that provides generic APIs for configuring inputs and destinations for data output. Beats implement the data collection functionality that's specific to the type of data (such as logs and metrics) that they collect. A range of community-developed Beats are available, in addition to the officially produced Beats agents.

Beats modules and the Elastic Common Schema

The modules that are available in Beats allow you to collect consistent datasets and the distribution of out-of-the-box dashboards, machine learning jobs, and alerts for users to leverage in their use cases.

Importance of a unified data model

One of the most important aspects of ingesting data into a centralized logging platform is paying attention to the data format in use. A Unified Data Model (UDM) is an especially useful tool to have, ensuring data can be easily consumed by end users once ingested into a logging platform. Enterprises typically follow a mixture of two approaches to ensure the log data complies with their unified data model:

  • Enforcing a logging standard or specification for log-producing applications in the company.

This approach is often considerably costly to implement, maintain, and scale. Changes in the log schema at the source can also have unintended downstream implications in other applications consuming the data. It is common to see UDMs evolve quite rapidly as the nature and the content of the logs that have been collected change. The use of different technology stacks or frameworks in an organization can also make it challenging to log with consistency and uniformity across the environment.

  • Transforming/renaming fields in incoming data using an ETL tool such as Logstash to comply with the UDM. Organizations can achieve relatively successful results using this approach, with considerably fewer upfront costs when reworking logging formats and schemas. However, the approach does come with some downsides:

(a) Parsers need to be maintained and constantly updated to make sure the logs are extracted and stored correctly.

(b) Most of the parsing work usually needs to be done (or overlooked) by a central function in the organization (because of the centralized nature of the transformation), rather than by following a self-service or DevOps-style operating model.

Elastic Common Schema

The Elastic Common Schema (ECS) is a unified data model set by Elastic. The following ECS specifications have a few advantages over a custom or internal UDM:

  • ECS sets Elasticsearch index mappings for fields. This is important so that metric aggregations and ranges can be applied properly to data. Numeric fields such as the number of bytes received as part of a network request should be mapped as an integer value. This allows a visualization to sum the total number of bytes received over a certain period. Similarly, the HTTP status code field needs to be mapped as a keyword so that a visualization can count how many 500 errors the application encountered.
  • Out-of-the-box content such as dashboards, visualizations, machine learning jobs, and alerts can be used if your data is ECS-compliant. Similarly, you can consume content from and share it with the open source community.
  • You can still add your own custom or internal fields to ECS by following the naming conventions that have been defined as part of the ECS specification. You do not have to use just the fields that are part of the ECS specification.

Beats modules

Beats modules can automatically convert logs and metrics from various supported data sources into an ECS-compliant schema. Beats modules also ship with out-of-the-box dashboards, machine learning jobs, and alerts. This makes it incredibly easy to onboard a new data source onto Elasticsearch using a Beat, and immediately being able to consume this data as part of a value-adding use case in your organization. There is a growing list of supported Filebeat and Metricbeat modules available on the Elastic integration catalog.

Onboarding and managing data sources at scale

Managing a range of Beats agents can come with significant administrative overhead, especially in large and complex environments. Onboarding new data sources would require updating configuration files, which then need to be deployed on the right host machine.

Elastic Agent is a single, unified agent that can be used to collect logs, metrics, and uptime data from your environment. Elastic Agent orchestrates the core Beats agents under the hood but simplifies the deployment and configuration process, given teams now need to manage just one agent.

Elastic Agent can also work with a component called Fleet Server to simplify the ongoing management of the agents and the data they collect. Fleet can be used to centrally push policies to control data collection and manage agent version upgrades, without any additional administrative effort. We take look at Elastic Agent in more detail in Chapter 9, Managing Data Onboarding with Elastic Agent.

Centralized extraction and transformation and loading your data with Logstash

While Beats make it very convenient to onboard a new data source, they are designed to be lightweight in terms of performance footprint. As such, Beats do not provide a great deal of heavy processing, transformation, and enrichment capabilities. This is where Logstash comes in to help your ingestion architecture.

Logstash is a general-purpose ETL tool designed to input data from any number of source systems/communication protocols. The data is then processed through a set of filters, where you can mutate, add, enrich, or remove fields as required. Finally, events can be sent to several destination systems. This configuration is defined as a Logstash parser. We will dive deeper into Logstash and how it can be used for various ETL use cases in Chapter 7, Using Logstash to Extract, Transform, and Load Data.

Deciding between using Beats and Logstash

Beats and Logstash are designed to serve specific requirements when collecting and ingesting data. Users are often confused when deciding between Beats or Logstash when onboarding a new data source. The following list aims to make this clearer.

When to use Beats

Beats should be used when the following points apply to your use case:

  • When you need to collect data from a large number of hosts or systems from across your environment. Some examples are as follows:

(a) Collecting web logs from a dynamic group of hundreds of web servers

(b) Collecting logs from a large number of microservices running on a container orchestration platform such as Kubernetes

(c) Collecting metrics from a group of MySQL instances in cloud/on-premises locations

  • When there is a supported Beats module available.
  • When you do not need to perform a significant amount of transformation/processing before consuming data on Elasticsearch.
  • When consuming from a web source, you do not need to have scaling/throughput concerns in place for a single beat instance.

When to use Logstash

Logstash should be used when you have the following requirements:

  • When a large amount of data is consumed from a centralized location (such as a file share, AWS S3, Kafka, and AWS Kinesis) and you need to be able to scale ingestion throughput.
  • When you need to transform data considerably or parse complex schemas/codecs, especially using regular expressions or Grok.
  • When you need to be able to load balance ingestion across multiple Logstash instances.
  • When a supported Beats module is not available.

It is worth noting that Beats agents are continually updated and enhanced with every release. The gap between the capabilities of Logstash and Beats has closed considerably over the last few releases.

Using Beats and Logstash together

It is quite common for organizations to get the best of both worlds by leveraging Beats and Logstash together. This allows data to be collected from a range of sources while enabling centralized processing and transformation of events.

Now that we understand how we can ingest data into the Elastic Stack, let's look at the options that are available when running the stack.