Book Image

Machine Learning for Streaming Data with Python

By : Joos Korstanje
Book Image

Machine Learning for Streaming Data with Python

By: Joos Korstanje

Overview of this book

Streaming data is the new top technology to watch out for in the field of data science and machine learning. As business needs become more demanding, many use cases require real-time analysis as well as real-time machine learning. This book will help you to get up to speed with data analytics for streaming data and focus strongly on adapting machine learning and other analytics to the case of streaming data. You will first learn about the architecture for streaming and real-time machine learning. Next, you will look at the state-of-the-art frameworks for streaming data like River. Later chapters will focus on various industrial use cases for streaming data like Online Anomaly Detection and others. As you progress, you will discover various challenges and learn how to mitigate them. In addition to this, you will learn best practices that will help you use streaming data to generate real-time insights. By the end of this book, you will have gained the confidence you need to stream data in your machine learning models.
Table of Contents (17 chapters)
1
Part 1: Introduction and Core Concepts of Streaming Data
5
Part 2: Exploring Use Cases for Data Streaming
11
Part 3: Advanced Concepts and Best Practices around Streaming Data
15
Chapter 12: Conclusion and Best Practices

Working with streaming data

Streaming data is data that is streamed. You may know the term streaming from online video services on which you can stream video. When doing this, the video streaming service will continue sending the next parts of the video to you while you are already watching the first part of the video.

The concept is the same when working with streaming data. The data format is not necessarily video and can be any data type that is useful for your use case. One of the most intuitive examples is that of an industrial production line, in which you have continuous measurements from sensors. As long as your production line doesn't pause, you will continue to generate measurements. We will check out the following overview of the data streaming process:

Figure 1.3 – The data streaming process

Figure 1.3 – The data streaming process

The important notion is that you have a continuous flow of data that you need to treat in real time. You cannot wait until the production line stops to do your analysis, as you would need to detect potential problems right away.

Streaming data versus batch data

Streaming data is generally not among the first use cases that new data scientists tend to start with. The type of problem that is usually introduced first is batch use cases. Batch data is the opposite of streaming data, as it works in phases: you collect a bunch of data, and then you treat a bunch of data.

If you see streaming data as streaming a video online, you could see batch data as downloading the entire video first and then watching it when the downloading is finished. For analytical purposes, this would mean that you get the analysis of a bunch of data when the data generating process is finished rather than whenever a problem occurs.

For some use cases, this is not a problem. Yet, you can understand that streaming can deliver great added value in those use cases where fast analytics can have an impact. It also has added value in use cases where data is ingested in a streaming method, which is becoming more and more common. In practice, many use cases that would get added value through streaming are still solved with batch treatment, just because these methods are better known and more widespread.

The following overview shows the batch treatment process:

Figure 1.4 – The batch process

Figure 1.4 – The batch process

Advantages of streaming data

Let's now look at some advantages of using streaming analytics rather than other approaches in the following subsections.

Data generating processes are in real time

The first advantage of building streaming data analytics rather than batch systems is that many data generating processes are actually in real time. You will discover a number of use cases later, but in general, it is rare that data collection is done in batches.

Although most of us are used to building batch systems around real-time data generating systems, it often makes more sense to build streaming analytics directly.

Of course, batch analytics and streaming analytics can co-exist. Yet, adding a batch treatment to a streaming analytics service is often much easier than adding streaming functionality into a system that is designed for batches. It simply makes the most sense to start with streaming.

Real-time insights have value

When designing data science solutions, streaming does not always come to mind first. However, when solutions or tools are built in real time, it is rare that the real-time functionality is not appreciated.

Many analytical solutions of today are built in real time and the tools are available. In many problems, real-time information will be used at some point. Maybe it will not be used from the start, but the day that anomalies happen, you will find a great competitive advantage in having the analytics straight away, rather than waiting till the next hour or the next morning.

Examples of successful implementation of streaming analytics

Let's talk about some examples of companies that have implemented real-time analytics successfully. The first example is Shell. They have been able to implement real-time analytics of their security cameras on their gas stations. An automated and real-time machine learning pipeline is able to detect whether people are smoking.

Another example is the use of sensor data in connected sports equipment. By measuring heart rate and other KPIs in real time, they are able to alert you when anything is wrong with your body.

Of course, the big players such as Facebook and Twitter also analyze a lot of data in real time, for example, when detecting fake news or bad content. There are many successful use cases of streaming analytics, yet at the same time, there are some common challenges that streaming data brings with them. Let's have a look at them now.

Challenges of streaming data

Streaming data analytics are currently less widespread than batch data analytics. Although this is slowly changing, it is good to understand where the challenges are when working with streaming data.

Knowledge of streaming analytics

One simple reason for streaming analytics being less widespread is a question of knowledge and know-how. Setting up streaming analytics is often not taught in schools and is definitely not taught as the go-to method. There are also fewer resources available on the internet to get started with it. As there are much more resources on machine learning and analytics for batch treatment, and the batch methods do not apply to streaming data, people tend to start with batch applications for data science.

Understanding the architecture

A second difficulty when working on streaming data is architecture. Although some data science practitioners have knowledge of architecture, data engineering, and DevOps, this is not always the case. To set up a streaming analytics proof of concept or a minimum viable product (MVP), all those skills are needed. For batch treatment, it is often enough to work with scripts.

Architectural difficulties are inherent to streaming, as it is necessary to work with real-time processes that send individually collected records to an analytical treatment process that will update in real time. If there is no architecture that can handle this, it does not make much sense to start with streaming analytics.

Financial hurdles

Another challenge when working with streaming data is the financial aspect. Although working with streaming is not necessarily more expensive in the long run, it can be more expensive to set up the infrastructure needed to get started. Working on a local developer PC for an MVP is unlikely to succeed as the data needs to be treated in real time.

Risks of runtime problems

Real-time processes also have a larger risk of runtime problems. When building software, bugs and failures happen. If you are on a daily batch process, you may be able to repair the process, rerun the failed batch, and solve the problem.

If a streaming tool is down, there are risks of losing data. As the data should be ingested in real time, the data that is generated during a time-out of your process may not be recuperable. If your process is very important, you will need to set up extensive monitoring day and night and have more quality checks before pushing your solutions to production. Of course, this is also important in batch processes, but even more so in streaming.

Smaller analytics (fewer methods easily available)

The last challenge of streaming analytics is that the common methods are generally developed for batch data first. There are currently many solutions out there for analytics on real time and streaming data, but still not as many as for batch data.

Also, since the streaming analysis has to be done very quickly to respect real-time delivery, streaming use cases tend to end up with much less interesting analytical methodologies and stay at the basic level of descriptive or basic analyses.

How to get started with streaming data

For companies to get started with streaming data, the first step is often to start by putting in place simple applications that collect real-time data and make that real-time data accessible in real time. Common use cases to start with are log data, website visits data, or sensor data.

A next step would often be to build reporting tools on top of the real-time data source. You can think about KPI dashboards that update in real time, or small and simple alerting tools based on high or low threshold values based on business rules.

When such systems are in place, this leads the way to replace those business rules, or add on top of them. You can think about more advanced analytics tools including real-time machine learning for anomaly detection and more.

The most complex step is to add automated feedback loops between your real-time machine learning and your process. After all, there is no reason to stop at analytics for business insights if there is potential to automate and improve decision-making as well.

Common use cases for streaming data

Let's see a few of the most common use cases for streaming data so that you can get a better feel of the use cases that can benefit from streaming techniques. This will cover three use cases that are relatively accessible for anyone, but of course, there are many more.

Sensor data and anomaly detection

A common use case for streaming data is the analysis of sensor data. Sensor data can occur in a multitude of use cases, such as industry production lines and IoT use cases. When companies decide to collect sensor data, it is often treated in real time.

For a production line, there is great value in detecting anomalies in real time. When too many anomalies occur, the production line can be shut down or the problem can be solved before a number of faulty products are delivered.

A good example of streaming analytics for monitoring humidity for artwork can be found here: https://azure.github.io/iot-workshop-asset-tracking/step-003-anomaly-detection/.

Finance and regression forecasting

Finance data is another great use case for streaming data. For example, in the world of stock trading, timing is important. The faster you can detect up or downtrends in the stock market, the faster a trader (or algorithm) can react by selling or buying stocks and making money.

A great example is described in the following paper by K.S Umadevi et al (2018): https://ieeexplore.ieee.org/document/8554561.

Clickstream for websites and classification

Websites or apps are a third common use case for real-time insights. If you can track and analyze your visitors in real time, you can propose a personalized experience for them on your website. By proposing products or services that match with a website visitor, you can increase your online sales.

The following paper by Ramanna Hanamanthrao and S Thejaswini (2017) gives a great use case for this technology applied to clickstream data: https://ieeexplore.ieee.org/abstract/document/8256978.

Streaming versus big data

It is important to understand different definitions of streaming that you may encounter. One distinction to make is between streaming and big data. Some definitions will consider streaming mainly in a big data (Hadoop/Spark) context, whereas others do not.

Streaming solutions often have a large volume of data, and big data solutions can be the appropriate choice. However, other technologies, combined with a well-chosen hardware architecture, may also be able to do the analytics in real time and, therefore, build streaming solutions without big data technologies.

Streaming versus real-time inference

Real-time inference of models is often built and made accessible via an API. As we define streaming as the analysis of data in real time without batches, such predictions in real time can be considered streaming. You will see more about real-time architectures in a later chapter.