Scalable Data Streaming with Amazon Kinesis

By : Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Scalable Data Streaming with Amazon Kinesis

By: Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Overview of this book

Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. This data streaming service provides APIs and client SDKs that enable you to produce and consume data at scale. Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams, along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use case shown through the book to help you get started and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you’ll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elastic Search, and third-party applications such as Splunk. By the end of this AWS book, you’ll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Data Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Introduction to Data Streaming and Amazon Kinesis

Free Chapter

Chapter 1: What Are Data Streams?

Introducing data streams

The value of real-time data in analytics

Decoupling systems

Challenges associated with distributed systems

Overview of messaging concepts

Examples of data streaming

Summary

Further reading

Chapter 2: Messaging and Data Streaming in AWS

Amazon Kinesis Data Streams (KDS)

Amazon Kinesis Data Firehose (KDF)

Amazon Kinesis Data Analytics (KDA)

Amazon Kinesis Video Streams (KVS)

Amazon Simple Queue Service (SQS)

Amazon Simple Notification Service (SNS)

Amazon MQ for Apache ActiveMQ

IoT Core

Amazon Managed Streaming for Apache Kafka (MSK)

Amazon EventBridge

Service comparison summary

Summary

Chapter 3: The SmartCity Bike-Sharing Service

The mission for sustainable transportation

SmartCity new mobile features

The AWS Well-Architected Framework

Summary

Further reading

Section 2: Deep Dive into Kinesis

Chapter 4: Kinesis Data Streams

Technical requirements

Discovering Amazon Kinesis Data Streams

Creating a stream producer application

Creating a stream consumer application

Data pipelines with Amazon Kinesis Data Streams

Summary

Further reading

Chapter 5: Kinesis Firehose

Technical requirements

Discovering Amazon Kinesis Firehose

Understanding encryption in KDF

Using data transformation in KDF with a Lambda function

Understanding delivery stream destinations

Understanding data format conversion in KDF

Understanding monitoring in KDF

Use-case example – Bikeshare station data pipeline with KDF

Summary

Further reading

Chapter 6: Kinesis Data Analytics

Technical requirements

Discovering Amazon KDA

Working on SmartCity bike share analytics use cases

Creating operational insights using SQL Engine

Creating operational insights using Apache Flink

Building bike ride analytic applications

Monitoring KDA applications

Summary

Further reading

Chapter 7: Amazon Kinesis Video Streams

Technical requirements

Understanding video fundamentals

Discovering Amazon Kinesis video streams WebRTC

Discovering Amazon KVS

Building video-enabled applications with KVS

Summary

Further reading

Section 3: Integrations

Chapter 8: Kinesis Integrations

Technical requirements

Amazon services that can produce data to send to Kinesis

Amazon services that transform Kinesis data

Third-party integrations with Kinesis

Summary

Further reading

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Overview of messaging concepts

In this section, we will review the concept of message brokers in a high-level, implementation-agnostic manner. First, we will go over the core components of all messaging systems and then we will review some key terminology and concepts related to their use.

Overview of core messaging components

There are four components in all messaging systems: producers, consumers, streams, and messages. The following diagram shows a logical breakdown of producers sending messages to a stream, the stream buffering them, and consumers receiving them:

Figure 1.3 – High-level view of messaging

Despite this design's relative simplicity, there is a substantial amount of configuration and optimization that is possible. Now, let's dive a little deeper into each component.

Streams

The stream is the system that stores the messages or records sent by the producers and retrieved by the consumers. They can be ordered in a First In First Out (FIFO) model. Messages in the stream that have been received, but not yet retrieved, are referred to as a backlog.

The retention period is the length of time that the records are accessible after they are added to the stream. This is the maximum size the backlog can be, and it is also the maximum time a new, slow, or intermittent consumer can access the records.

Messages (records)

A message consists of a payload and header information. The header information consists of information set by the producer, and it includes a unique identifier assigned by the message broker when it is inserted into the stream. In general, messages are relatively small, in the order of kilobytes, and messaging systems generally have a maximum payload size.

Producers

The producer is an application that is the source of data that will go into the message or record. It connects to the message broker and puts the data into the stream. There can often be multiple producers sending data to the same message broker.

Consumers

The consumer is an application that receives the messages that are sent by the producer. It connects to the message broker and retrieves the data from the stream. The responsibility for keeping track of the last read message, so that the consumer can retrieve the next message, can be handled either by the message broker (RabbitMQ or SQS) or by the consumer (Kinesis or Kafka). There can be multiple consumers for a message broker.

Real-time analytics

When thinking about real-time analytics, it can be useful to expand it from the producer, stream, consumer model to a five-stage model (Figure 1.3): 1) source of data; 2) data ingestion mechanisms; 3) stream storage; 4) real-time stream processing; and 5) destination, data sink, or action. This model helps us elevate our thinking from the structural communication level to the data processing level. For instance, filtering can be applied at every stage to reduce compute downstream.

The source of data refers to where the data is coming from. For example, it could be mobile devices, web clickstreams, log analytics, IoT devices, or smart devices. Once you have the source, the data needs to be ingested into the stream. This requires a solution that can capture data coming from hundreds of thousands of devices, in a scalable and reliable manner, into a stream for analysis. You then need a platform that can reliably and durably store the data while simultaneously reading from any point in the stream. This refers to the stream storage platform. The stored data is then processed by real-time applications to generate actionable insights, perform actions, and execute real-time extract-transform-load (ETL) operations that deliver the stream of data to an end destination, such as a data lake.

Next, let's see how systems can be designed in a resilient manner.

Messaging concepts

While relatively simple, the implementation of the four components can be nuanced. In all networked systems, failure is complicated. Every network call can have issues, and the systems need to be resilient to handle them. In the following sub-sections, we will review a few key concepts related to resilient systems and also a few advanced stream processing features.

Here are eight fallacies associated with distributed computing. In 1994, Peter Deutsch identified the fact that everyone who builds distributed systems initially gets into trouble by making the assumptions listed here:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
The topology doesn't change.
There is one administrator.
The cost of transport is zero.
The network is homogeneous (added by James Gosling in 1997).
Note
All systems should be designed with those fallacies in mind, and with special attention to the unreliability of the network. Systems that don't properly handle these issues will exhibit complicated and confusing behavior as well as error modes that are challenging to debug.

Timeouts

Timeouts allow for efficient allocation of resources and help prevent cascading failures. If an individual process has an error, it can fail to return a value and hang. In this case, the client may continue to wait indefinitely for a response. Timeouts help prevent server resource exhaustion by ending the connection after a maximum amount of time has passed. This allows the server to free up limited resources, for example, memory, connections, and ports, and use them to handle new requests. The client can retry the request again.

Retries

Many errors are ephemeral, and merely retrying the exact same request again will succeed. In order for retries to be safe, the system handling them must be idempotent, meaning that it is designed in such a way that the same input will cause the same side effects. At a more systemic level, to prevent a server from being inundated with retry requests, each client should implement back-off and jitter.

Back-off is the process of increasing the time between subsequent retries. Jitter is the process of adding a bit of random delay to retries. Together, these two mechanisms spread out message requests over time so that the server is able to handle the number of requests.

When a producer has to retry due to a timeout, it will send the request again. There is the possibility that a duplicate record could be created. If a record should only be processed once, it is important that the payload of the record has a unique ID that the final system can use to remove duplicates. When a consumer fails, it can fetch the same records again. Consumer retries tend to happen more often than producer retries. It is up to the final application to handle the message payload data properly and in an idempotent manner.

Backlogs

A backlog is the number of messages that the stream contains that have yet to be received by a consumer. Backlogs occur when the number of messages a producer sends into a stream is higher than the number of messages received by a consumer. This often happens when the system consuming the messages has an error and the messages keep being added to the stream. This can quickly go from a small backlog to a large backlog. Large backlogs increase the overall system latency by a large amount as the backlog must be processed before the recently arriving messages are processed. This typically results in a bimodal distribution of message latencies, where the latency is low when the system is working correctly and high when the system is having errors.

Large backlogs are a hidden risk that need to be considered when designing asynchronous applications because they can increase the recovery time following an outage – that is, instead of merely restarting the system and it being down for a brief period of time, the system has to work through the large backlog before it can function properly again.

Dead letter queues

Dead letter queues store messages that cannot be processed correctly by the message broker for some reason or another. It could be that it is an invalid message, it is too big, or, for some reason, it fails a certain number of retry attempts. It is important to periodically review dead letter queues because they represent errors in the system.

Replay

Replay is the ability to read, or replay, the same records in the same order multiple times. This means that a new consumer can be added and re-read messages that have already been consumed. Replay is limited to data in the stream. Data is aged out of the stream after it has existed for a specified period of time, for example, 1 hour, 1 day, or 7 days. This retention period affects the amount of storage required to support the stream.

Record processing

When processing records, there are multiple approaches depending on the type of data in the payload and the type of analysis required. In the simplest of systems, each record is processed one at a time, that is, record by record. A more complicated approach is to aggregate records by a sliding time window, where records are accessed by the consumer over a period of time, for instance, calculating the highest, lowest, and the average message value over the last 10 seconds.

Filtering

Filtering allows consumers to receive only the messages that they are interested in. This reduces the amount of data that is needed to be processed and transmitted, which helps the system scale. Messages can be filtered at multiple stages: source, ingestion, stream storage, stream processing, and in the consumer stage. In general, it is best to filter messages as early in the five-stage model as possible as it reduces compute and storage requirements in all subsequent stages. Filtering is determined by the message contents, the source, or the destination. For instance, the producer can send different types of messages to different streams.

Now that we've covered the core concepts, let's see them applied in some example use cases.

Scalable Data Streaming with Amazon Kinesis

By : Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Scalable Data Streaming with Amazon Kinesis

By: Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Overview of this book

Related Content you might be interested in

Current Title:

Scalable Data Streaming with Amazon Kinesis

Serverless Architectures with AWS

Data Engineering with AWS

Geospatial Data Analytics on AWS

Overview of messaging concepts

Overview of core messaging components

Streams

Messages (records)

Producers

Consumers

Real-time analytics

Messaging concepts

Timeouts

Retries

Backlogs

Dead letter queues

Replay

Record processing

Filtering