Book Image

Scalable Data Streaming with Amazon Kinesis

By : Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti
Book Image

Scalable Data Streaming with Amazon Kinesis

By: Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Overview of this book

Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. This data streaming service provides APIs and client SDKs that enable you to produce and consume data at scale. Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams, along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use case shown through the book to help you get started and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you’ll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elastic Search, and third-party applications such as Splunk. By the end of this AWS book, you’ll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Data Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).
Table of Contents (13 chapters)
1
Section 1: Introduction to Data Streaming and Amazon Kinesis
5
Section 2: Deep Dive into Kinesis
10
Section 3: Integrations

Overview of messaging concepts

In this section, we will review the concept of message brokers in a high-level, implementation-agnostic manner. First, we will go over the core components of all messaging systems and then we will review some key terminology and concepts related to their use.

Overview of core messaging components

There are four components in all messaging systems: producers, consumers, streams, and messages. The following diagram shows a logical breakdown of producers sending messages to a stream, the stream buffering them, and consumers receiving them:

Figure 1.3 – High-level view of messaging

Figure 1.3 – High-level view of messaging

Despite this design's relative simplicity, there is a substantial amount of configuration and optimization that is possible. Now, let's dive a little deeper into each component.

Streams

The stream is the system that stores the messages or records sent by the producers and retrieved by the consumers. They can be ordered in a First In First Out (FIFO) model. Messages in the stream that have been received, but not yet retrieved, are referred to as a backlog.

The retention period is the length of time that the records are accessible after they are added to the stream. This is the maximum size the backlog can be, and it is also the maximum time a new, slow, or intermittent consumer can access the records.

Messages (records)

A message consists of a payload and header information. The header information consists of information set by the producer, and it includes a unique identifier assigned by the message broker when it is inserted into the stream. In general, messages are relatively small, in the order of kilobytes, and messaging systems generally have a maximum payload size.

Producers

The producer is an application that is the source of data that will go into the message or record. It connects to the message broker and puts the data into the stream. There can often be multiple producers sending data to the same message broker.

Consumers

The consumer is an application that receives the messages that are sent by the producer. It connects to the message broker and retrieves the data from the stream. The responsibility for keeping track of the last read message, so that the consumer can retrieve the next message, can be handled either by the message broker (RabbitMQ or SQS) or by the consumer (Kinesis or Kafka). There can be multiple consumers for a message broker.

Real-time analytics

When thinking about real-time analytics, it can be useful to expand it from the producer, stream, consumer model to a five-stage model (Figure 1.3): 1) source of data; 2) data ingestion mechanisms; 3) stream storage; 4) real-time stream processing; and 5) destination, data sink, or action. This model helps us elevate our thinking from the structural communication level to the data processing level. For instance, filtering can be applied at every stage to reduce compute downstream.

The source of data refers to where the data is coming from. For example, it could be mobile devices, web clickstreams, log analytics, IoT devices, or smart devices. Once you have the source, the data needs to be ingested into the stream. This requires a solution that can capture data coming from hundreds of thousands of devices, in a scalable and reliable manner, into a stream for analysis. You then need a platform that can reliably and durably store the data while simultaneously reading from any point in the stream. This refers to the stream storage platform. The stored data is then processed by real-time applications to generate actionable insights, perform actions, and execute real-time extract-transform-load (ETL) operations that deliver the stream of data to an end destination, such as a data lake.

Next, let's see how systems can be designed in a resilient manner.

Messaging concepts

While relatively simple, the implementation of the four components can be nuanced. In all networked systems, failure is complicated. Every network call can have issues, and the systems need to be resilient to handle them. In the following sub-sections, we will review a few key concepts related to resilient systems and also a few advanced stream processing features.

Here are eight fallacies associated with distributed computing. In 1994, Peter Deutsch identified the fact that everyone who builds distributed systems initially gets into trouble by making the assumptions listed here:

  • The network is reliable.
  • Latency is zero.
  • Bandwidth is infinite.
  • The network is secure.
  • The topology doesn't change.
  • There is one administrator.
  • The cost of transport is zero.
  • The network is homogeneous (added by James Gosling in 1997).

    Note

    All systems should be designed with those fallacies in mind, and with special attention to the unreliability of the network. Systems that don't properly handle these issues will exhibit complicated and confusing behavior as well as error modes that are challenging to debug.

Timeouts

Timeouts allow for efficient allocation of resources and help prevent cascading failures. If an individual process has an error, it can fail to return a value and hang. In this case, the client may continue to wait indefinitely for a response. Timeouts help prevent server resource exhaustion by ending the connection after a maximum amount of time has passed. This allows the server to free up limited resources, for example, memory, connections, and ports, and use them to handle new requests. The client can retry the request again.

Retries

Many errors are ephemeral, and merely retrying the exact same request again will succeed. In order for retries to be safe, the system handling them must be idempotent, meaning that it is designed in such a way that the same input will cause the same side effects. At a more systemic level, to prevent a server from being inundated with retry requests, each client should implement back-off and jitter.

Back-off is the process of increasing the time between subsequent retries. Jitter is the process of adding a bit of random delay to retries. Together, these two mechanisms spread out message requests over time so that the server is able to handle the number of requests.

When a producer has to retry due to a timeout, it will send the request again. There is the possibility that a duplicate record could be created. If a record should only be processed once, it is important that the payload of the record has a unique ID that the final system can use to remove duplicates. When a consumer fails, it can fetch the same records again. Consumer retries tend to happen more often than producer retries. It is up to the final application to handle the message payload data properly and in an idempotent manner.

Backlogs

A backlog is the number of messages that the stream contains that have yet to be received by a consumer. Backlogs occur when the number of messages a producer sends into a stream is higher than the number of messages received by a consumer. This often happens when the system consuming the messages has an error and the messages keep being added to the stream. This can quickly go from a small backlog to a large backlog. Large backlogs increase the overall system latency by a large amount as the backlog must be processed before the recently arriving messages are processed. This typically results in a bimodal distribution of message latencies, where the latency is low when the system is working correctly and high when the system is having errors.

Large backlogs are a hidden risk that need to be considered when designing asynchronous applications because they can increase the recovery time following an outage – that is, instead of merely restarting the system and it being down for a brief period of time, the system has to work through the large backlog before it can function properly again.

Dead letter queues

Dead letter queues store messages that cannot be processed correctly by the message broker for some reason or another. It could be that it is an invalid message, it is too big, or, for some reason, it fails a certain number of retry attempts. It is important to periodically review dead letter queues because they represent errors in the system.

Replay

Replay is the ability to read, or replay, the same records in the same order multiple times. This means that a new consumer can be added and re-read messages that have already been consumed. Replay is limited to data in the stream. Data is aged out of the stream after it has existed for a specified period of time, for example, 1 hour, 1 day, or 7 days. This retention period affects the amount of storage required to support the stream.

Record processing

When processing records, there are multiple approaches depending on the type of data in the payload and the type of analysis required. In the simplest of systems, each record is processed one at a time, that is, record by record. A more complicated approach is to aggregate records by a sliding time window, where records are accessed by the consumer over a period of time, for instance, calculating the highest, lowest, and the average message value over the last 10 seconds.

Filtering

Filtering allows consumers to receive only the messages that they are interested in. This reduces the amount of data that is needed to be processed and transmitted, which helps the system scale. Messages can be filtered at multiple stages: source, ingestion, stream storage, stream processing, and in the consumer stage. In general, it is best to filter messages as early in the five-stage model as possible as it reduces compute and storage requirements in all subsequent stages. Filtering is determined by the message contents, the source, or the destination. For instance, the producer can send different types of messages to different streams.

Now that we've covered the core concepts, let's see them applied in some example use cases.