Book Image

Scalable Data Streaming with Amazon Kinesis

By : Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti
Book Image

Scalable Data Streaming with Amazon Kinesis

By: Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Overview of this book

Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. This data streaming service provides APIs and client SDKs that enable you to produce and consume data at scale. Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams, along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use case shown through the book to help you get started and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you’ll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elastic Search, and third-party applications such as Splunk. By the end of this AWS book, you’ll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Data Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).
Table of Contents (13 chapters)
1
Section 1: Introduction to Data Streaming and Amazon Kinesis
5
Section 2: Deep Dive into Kinesis
10
Section 3: Integrations

Examples of data streaming

Data streams are essential for supporting a wide variety of workloads. This section will go into detail on how data streams can be used for near real-time monitoring of applications through log aggregation, support bursty IoT workloads, be fast to insert recommendations into web applications, and enable machine learning on video. The following diagram shows the data flow of these workloads:

Figure 1.4 – Examples of data stream applications

Figure 1.4 – Examples of data stream applications

While these workloads have different performance requirements and scale, the fundamental architecture is the same – producing and consuming messages. Now, let's look at an example of real-time monitoring.

Application log processing

Near real-time monitoring of applications and systems can be used to identify usage patterns, troubleshoot operation events, detect and monitor security incidents, and ensure compliance. Log events are generated on multiple systems and are pushed to a centralized system for analysis. Messaging systems enable this by decoupling the log processing and the analysis systems. In general, for log analysis, there are two different systems consuming the messages: one for near real-time analysis and one for larger historical batch analysis. The near real-time analysis system, often Elasticsearch, contains only fresh data as specified by a data retention policy, and might only hold an hour, a day, or a week's worth of information. The historical system is often an Apache Spark cluster processing data in a data lake (data stored in S3).

Log events are generated in real time and are pushed to the messaging system. The two consumers access the data and perform ETL operations on the data to convert it into the appropriate format for further analysis. For instance, an Apache Commons Logging format can be converted to JSON for insertion into Elasticsearch. The message broker simplifies the system by providing a clear boundary between the log collection and log analysis systems. Since it's designed in a highly available manner, it can cache events if the log analysis system goes down.

There are many sources of log events; two common ones are CloudWatch Logs and agents that can be installed on a machine, for example, Kinesis Agent. CloudWatch is an AWS service that collects logs, metrics, and events from AWS resources and user applications. The logs are sent to streams based on subscriptions and subscription filters that define patterns to determine which log events should be sent. The events are Base64-encoded and compressed with gzip. Agents monitor sets of files and stream events normally delineated by a new line (\n) character.

By bringing all the logs together in near real time, proactive measures can be taken. For example, imagine an attacker is trying to use an automated tool, for example, SQLMAP, to perform a SQL injection attack via an HTTP query string. A query string is a set of key-value pairs separated from the base URL by a question mark (?) character, and each key-value pair is separated by the ampersand (&) character. For example, in the following URL, there are two keys, key1 and key2, and their corresponding values, value1 and value2:

https://example.com/mypage?key1=value1&key2=value2

The first thing that will be detected is a lot of query strings that are different, originating from a single IP address. Once the IP address is identified, it can be blocked to prevent further attacks. The analysis system can be used to determine all requests made by the client and detect whether they were able to exploit any vulnerabilities.

Internet of Things

IoT devices present unique challenges as they are often only connected to the internet intermittently to save bandwidth and conserve energy. This intermittent connectivity, combined with a large number of devices, can lead to extremely bursty workloads. For instance, a fleet of IoT devices with temperature sensors might send data back every hour. The messaging system provides a buffer that allows downstream systems to be provisioned for the average velocity of data and not the peak loads.

Real-time recommendations

Clickstream events are generated at extremely high volume and velocity as users navigate and use web applications and mobile applications. Clickstream analysis can be used for A/B testing, understanding user engagement, detecting system issues, and in this example, recommendations.

Simple recommendations can be pre-computed based on historic usage patterns, for instance, people who watched this movie also liked these movies. However, this fails to capture the user's intent – that is, personalized recommendations depending on the user's behavior in the given session. This requires clickstream data to be captured in real time, analyzed, and recommendations made, all in the time it takes for a page to load. In other words, the system needs to work in milliseconds. These performance constraints require highly scalable messaging systems to achieve extremely low latency so that page load performance is not degraded.

Video streams

Video streams can be used for both real-time workloads (chat, peer to peer) or batch (surveillance, machine learning). In the batch case, multiple cameras can be streaming the video to the messaging system and machine learning can be applied to detect faces. These faces can then be identified and checked against a set of known individuals. Any face that doesn't match a known individual can trigger an alert and send the relevant portion of the video to the appropriate person. Messaging frameworks simplify the architecture by providing a highly scalable system to handle large volumes of data from multiple devices. Much like in the IoT case, they also provide a buffer to provide time for downstream resources to be provisioned in response to demand as new devices connect.