Scalable Data Streaming with Amazon Kinesis

By : Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Scalable Data Streaming with Amazon Kinesis

By: Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Overview of this book

Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. This data streaming service provides APIs and client SDKs that enable you to produce and consume data at scale. Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams, along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use case shown through the book to help you get started and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you’ll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elastic Search, and third-party applications such as Splunk. By the end of this AWS book, you’ll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Data Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Introduction to Data Streaming and Amazon Kinesis

Free Chapter

Chapter 1: What Are Data Streams?

Introducing data streams

The value of real-time data in analytics

Decoupling systems

Challenges associated with distributed systems

Overview of messaging concepts

Examples of data streaming

Summary

Further reading

Chapter 2: Messaging and Data Streaming in AWS

Amazon Kinesis Data Streams (KDS)

Amazon Kinesis Data Firehose (KDF)

Amazon Kinesis Data Analytics (KDA)

Amazon Kinesis Video Streams (KVS)

Amazon Simple Queue Service (SQS)

Amazon Simple Notification Service (SNS)

Amazon MQ for Apache ActiveMQ

IoT Core

Amazon Managed Streaming for Apache Kafka (MSK)

Amazon EventBridge

Service comparison summary

Summary

Chapter 3: The SmartCity Bike-Sharing Service

The mission for sustainable transportation

SmartCity new mobile features

The AWS Well-Architected Framework

Summary

Further reading

Section 2: Deep Dive into Kinesis

Chapter 4: Kinesis Data Streams

Technical requirements

Discovering Amazon Kinesis Data Streams

Creating a stream producer application

Creating a stream consumer application

Data pipelines with Amazon Kinesis Data Streams

Summary

Further reading

Chapter 5: Kinesis Firehose

Technical requirements

Discovering Amazon Kinesis Firehose

Understanding encryption in KDF

Using data transformation in KDF with a Lambda function

Understanding delivery stream destinations

Understanding data format conversion in KDF

Understanding monitoring in KDF

Use-case example – Bikeshare station data pipeline with KDF

Summary

Further reading

Chapter 6: Kinesis Data Analytics

Technical requirements

Discovering Amazon KDA

Working on SmartCity bike share analytics use cases

Creating operational insights using SQL Engine

Creating operational insights using Apache Flink

Building bike ride analytic applications

Monitoring KDA applications

Summary

Further reading

Chapter 7: Amazon Kinesis Video Streams

Technical requirements

Understanding video fundamentals

Discovering Amazon Kinesis video streams WebRTC

Discovering Amazon KVS

Building video-enabled applications with KVS

Summary

Further reading

Section 3: Integrations

Chapter 8: Kinesis Integrations

Technical requirements

Amazon services that can produce data to send to Kinesis

Amazon services that transform Kinesis data

Third-party integrations with Kinesis

Summary

Further reading

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Examples of data streaming

Data streams are essential for supporting a wide variety of workloads. This section will go into detail on how data streams can be used for near real-time monitoring of applications through log aggregation, support bursty IoT workloads, be fast to insert recommendations into web applications, and enable machine learning on video. The following diagram shows the data flow of these workloads:

Figure 1.4 – Examples of data stream applications

While these workloads have different performance requirements and scale, the fundamental architecture is the same – producing and consuming messages. Now, let's look at an example of real-time monitoring.

Application log processing

Near real-time monitoring of applications and systems can be used to identify usage patterns, troubleshoot operation events, detect and monitor security incidents, and ensure compliance. Log events are generated on multiple systems and are pushed to a centralized system for analysis. Messaging systems enable this by decoupling the log processing and the analysis systems. In general, for log analysis, there are two different systems consuming the messages: one for near real-time analysis and one for larger historical batch analysis. The near real-time analysis system, often Elasticsearch, contains only fresh data as specified by a data retention policy, and might only hold an hour, a day, or a week's worth of information. The historical system is often an Apache Spark cluster processing data in a data lake (data stored in S3).

Log events are generated in real time and are pushed to the messaging system. The two consumers access the data and perform ETL operations on the data to convert it into the appropriate format for further analysis. For instance, an Apache Commons Logging format can be converted to JSON for insertion into Elasticsearch. The message broker simplifies the system by providing a clear boundary between the log collection and log analysis systems. Since it's designed in a highly available manner, it can cache events if the log analysis system goes down.

There are many sources of log events; two common ones are CloudWatch Logs and agents that can be installed on a machine, for example, Kinesis Agent. CloudWatch is an AWS service that collects logs, metrics, and events from AWS resources and user applications. The logs are sent to streams based on subscriptions and subscription filters that define patterns to determine which log events should be sent. The events are Base64-encoded and compressed with gzip. Agents monitor sets of files and stream events normally delineated by a new line (\n) character.

By bringing all the logs together in near real time, proactive measures can be taken. For example, imagine an attacker is trying to use an automated tool, for example, SQLMAP, to perform a SQL injection attack via an HTTP query string. A query string is a set of key-value pairs separated from the base URL by a question mark (?) character, and each key-value pair is separated by the ampersand (&) character. For example, in the following URL, there are two keys, key1 and key2, and their corresponding values, value1 and value2:

https://example.com/mypage?key1=value1&key2=value2

The first thing that will be detected is a lot of query strings that are different, originating from a single IP address. Once the IP address is identified, it can be blocked to prevent further attacks. The analysis system can be used to determine all requests made by the client and detect whether they were able to exploit any vulnerabilities.

Internet of Things

IoT devices present unique challenges as they are often only connected to the internet intermittently to save bandwidth and conserve energy. This intermittent connectivity, combined with a large number of devices, can lead to extremely bursty workloads. For instance, a fleet of IoT devices with temperature sensors might send data back every hour. The messaging system provides a buffer that allows downstream systems to be provisioned for the average velocity of data and not the peak loads.

Real-time recommendations

Clickstream events are generated at extremely high volume and velocity as users navigate and use web applications and mobile applications. Clickstream analysis can be used for A/B testing, understanding user engagement, detecting system issues, and in this example, recommendations.

Simple recommendations can be pre-computed based on historic usage patterns, for instance, people who watched this movie also liked these movies. However, this fails to capture the user's intent – that is, personalized recommendations depending on the user's behavior in the given session. This requires clickstream data to be captured in real time, analyzed, and recommendations made, all in the time it takes for a page to load. In other words, the system needs to work in milliseconds. These performance constraints require highly scalable messaging systems to achieve extremely low latency so that page load performance is not degraded.

Video streams

Video streams can be used for both real-time workloads (chat, peer to peer) or batch (surveillance, machine learning). In the batch case, multiple cameras can be streaming the video to the messaging system and machine learning can be applied to detect faces. These faces can then be identified and checked against a set of known individuals. Any face that doesn't match a known individual can trigger an alert and send the relevant portion of the video to the appropriate person. Messaging frameworks simplify the architecture by providing a highly scalable system to handle large volumes of data from multiple devices. Much like in the IoT case, they also provide a buffer to provide time for downstream resources to be provisioned in response to demand as new devices connect.

Scalable Data Streaming with Amazon Kinesis

By : Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Scalable Data Streaming with Amazon Kinesis

By: Tarik Makota, Brian Maguire, Danny Gagne, Rajeev Chakrabarti

Overview of this book

Related Content you might be interested in

Current Title:

Scalable Data Streaming with Amazon Kinesis

Serverless Architectures with AWS

Data Engineering with AWS

Geospatial Data Analytics on AWS

Examples of data streaming

Application log processing

Internet of Things

Real-time recommendations

Video streams