Introducing data streams

Data streams are a way of storing a sequence of messages. They enable us to design systems where we think about state as a series of events instead of only entities and values, or rows and columns in a database. This shift in mindset and technology enables real-time analytics to extract the value from data by acting on it before it is stale. They also enable organizations to design and develop resilient software based on microservice architectures by helping them to decouple systems. We will begin with an overview of streaming data sources, why real-time data analysis is valuable, and how they can be used architecturally to decouple systems. We will then review the core challenges associated with distributed systems, and conclude with an overview of key messaging concepts and some high-level examples. Messages can contain a wide variety of information and come from different sources, so let's look at the primary sources and data formats.

Sources of data

The proliferation of data steadily increases from sources such as social media, IoT devices, web clickstreams, application logs, and video cameras. This data poses challenges to most systems, since it is typically high-velocity, intermittent, and bursty, making it difficult to adequately provision and design downstream systems. Payloads are generally small, except when containing audio or video data, and come in a variety of formats.

In this book, we will be focusing on three data formats. These formats include the following:

JavaScript Object Notation (JSON)
Log files
Time-encoded binary files such as video

JSON streams

JSON has become the dominant format for message serialization over the past 10 years. It is a lightweight data interchange format that is easy for humans to read and write and is based on the JavaScript object syntax. It has two data structures – hash tables and lists. A hash table consists of key-value pairs, {"key":"value"}, where the keys must be unique. A list is a set of values in a specific order, ["value 1", "value 2"]. The following code sample shows a sample IoT JSON message:

{
    "deviceid" : "device001",
    "eventTime": -192778200,
    "temp" : 68.4,
    "humidity" : 77.3,
    "coords" : {
        "latitude" : 32.779039,
        "longitude" : -96.808660
    }
}

Log file streams

Log files come in a variety of formats. Common ones include Apache Commons Logging, Apache Combined Log, Apache Error Log, and RFC3164 Syslog. They are plain text, and usually each line, delineated by a newline ('\n') character, is a separate log entry. In the following sample log, we see an HTTP GET request where the IP address is 10.13.37.01, the datetime of the request, the HTTP verb, the URL fragment, the HTTP version, the response code, and the size of the result.

The sample log line in Apache Commons Logging format is as follows:

10.13.37.01 - - [03/Sep/2017:12:00:01 +0830] "GET /mailman/listinfo/test HTTP/1.1" 200 2457

Time-encoded binary streams

Time-encoded binary streams consist of a time series of records where each record is related to the adjacent records (prior and subsequent records). These can be used for a wide variety of sensor data, from audio streams and RADAR signals to video streams. Throughout this book, the primary focus will be video streams and their applications.

Figure 1.1 – Time-encoded video data

As shown in Figure 1.1, video streams are composed of fragments, where each fragment is a self-contained sequence of media frames. There are no dependencies between fragments. We will discuss video streams in more detail in Chapter 7, Kinesis Video Streams. Now that we've covered the types of data that we'll be processing, let's take a step back to understand the value of real-time data in analytics.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Scalable Data Streaming with Amazon Kinesis

By : Makota, Brian Maguire, Gagne, Chakrabarti

Scalable Data Streaming with Amazon Kinesis

By: Makota, Brian Maguire, Gagne, Chakrabarti

Overview of this book

Introducing data streams

Sources of data

JSON streams

Log file streams

Time-encoded binary streams

Scalable Data Streaming with Amazon Kinesis

By : Makota, Brian Maguire, Gagne, Chakrabarti

Scalable Data Streaming with Amazon Kinesis

By: Makota, Brian Maguire, Gagne, Chakrabarti

Overview of this book

Introducing data streams

Sources of data

JSON streams

Log file streams

Time-encoded binary streams

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access