Book Image

Practical Real-time Data Processing and Analytics

Book Image

Practical Real-time Data Processing and Analytics

Overview of this book

With the rise of Big Data, there is an increasing need to process large amounts of data continuously, with a shorter turnaround time. Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible. This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own. We’ll cover topics such as how to set up components, basic executions, integrations, advanced use cases, alerts, and monitoring. You’ll be exposed to the popular tools used in real-time processing today such as Apache Spark, Apache Flink, and Storm. Finally, you will put your knowledge to practical use by implementing all of the techniques in the form of a practical, real-world use case. By the end of this book, you will have a solid understanding of all the aspects of real-time data processing and analytics, and will know how to deploy the solutions in production environments in the best possible manner.
Table of Contents (20 chapters)
Title Page
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

IOT – thoughts and possibilities


The Internet of Things: the term that was coined in 1999 by Kevin Ashton, has become one of the most promising door openers of the decade. Although we had an IoT precursor in the form of M2M and instrumentation control for industrial automation, the way IoT and the era of connected smart devices has arrived, is something that has never happened before. The following figure will give you a birds–eye view of the vastness and variety of the reach of IoT applications:

We are all surrounded by devices that are smart and connected; they have the ability to sense, process, transmit, and even act, based on their processing. The age of machines that was a science fiction a few years ago has become reality. I have connected vehicles that can sense and get un–locked/locked if I walk to them or away from them with keys. I have proximity sensing beacons in my supermarkets which sense my proximity to shelf and flash the offers to my cell phone. I have smart ACs that regulate the temperature based on the number of people in the room. My smart offices save electricity by switching the lights and ACs off in empty conference rooms. The list seems to be endless and growing every second.

At the heart of it IoT, is nothing but an ecosystem of connected devices which have the ability to communicate over the internet. Here, devices/things could be anything, like a sensor device, a person with a wearable, a place, a plant, an animal, a machine – well, virtually any physical item you could think of on this planet can be connected today. There are predominantly seven layers to any IoT platform; these are depicted and described in the following figure:

Following is quick description of all the 7 IoT application layers:

  • Layer 1: Devices, sensors, controllers and so on
  • Layer 2: Communication channels, network protocols and network elements, the communication, and routing hardware — telecom, Wi–Fi, and satellite
  • Layer 3: Infrastructure — it could be in-house or on the cloud (public, private, or hybrid)
  • Layer 4: Here comes the big data ingestion layer, the landing platform where the data from things/devices is collected for the next steps
  • Layer 5: The processing engine that does the cleansing, parsing, massaging, and analysis of the data using complex processing, machine learning, artificial intelligence, and so on, to generate insights in form of reports, alerts, and notifications
  • Layer 6: Custom apps, the pluggable secondary interfaces like visualization dashboards, downstream applications, and so on form part of this layer
  • Layer 7: This is the layer that has the people and processes that actually act on the insights and recommendations generated from the following systems

At an architectural level, a basic reference architecture of an IOT application is depicted in the following image:

In the previous figure, if we start with a bottom up approach, the lowest layers are devices that are sensors or sensors powered by computational units like RaspberryPi or Ardunio. The communication and data transference is generally, at this point, governed by lightweight options like Messaging Queue Telemetry Transport (MQTT) and Constrained Application protocol (CoAP) which are fast replacing the legacy options like HTTP. This layer is actually in conjunction to the aggregation or bus layer, which is essentially a Mosquitto broker and forms the event transference layer from source, that is, from the device to the processing hub. Once we reach the processing hub, we have all the data at the compute engine ready to be swamped into action and we can analyse and process the data to generate useful actionable insights. These insights are further integrated to web service API consumable layers for downstream applications. Apart from these horizontal layers, there are cross–cutting layers which handle the device provisioning and device management, identity and access management layer.

Now that we understand the high–level architecture and layers for standard IoT application, the next step is to understand the key aspects where an IoT solution is constrained and what the implications are on overall solution design:

  • Security: This is a key concern area for the entire data-driven solution segment, but the concept of big data and devices connected to the internet makes the entire system more susceptible to hacking and vulnerable in terms of security, thus making it a strategic concern area to be addressed while designing the solution at all layers for data at rest and in motion.
  • Power consumption/battery life: We are devising solutions for devices and not human beings; thus, the solutions we design for them should be of very low power consumption overall without taxing or draining battery life.
  • Connectivity and communication: The devices, unlike humans, are always connected and can be very chatty. Here again, we need a lightweight protocol for overall communication aspects for low latency data transfer.
  • Recovery from failures: These solutions are designed to run for billions of data process and in a self–sustaining 24/7 mode. The solution should be built with the capability to diagnose the failures, apply back pressure and then self–recover from the situation with minimal data loss. Today, IoT solutions are being designed to handle sudden spikes of data, by detecting a latency/bottle neck and having the ability to auto–scale–up and down elastically.
  • Scalability: The solutions need to be designed in a mode that its linearly scalable without the need to re–architect the base framework or design, the reason being that this domain is exploding with an unprecedented and un–predictable number of devices being connected with a whole plethora of future use cases which are just waiting to happen.

Next are the implications of the previous constraints of the IoT application framework, which surface in the form of communication channels, communication protocols, and processing adapters.

In terms of communication channel providers, the IoT ecosystem is evolving from telecom channels and LTEs to options like:

  • Direct Ethernet/WiFi/3G
  • LoRA
  • Bluetooth Low Energy (BLE)
  • RFID/Near Field communication (NFC)
  • Medium range radio mesh networks like Zigbee

For communication protocols, the de–facto standard that is on the board as of now is MQTT, and the reasons for its wide usage are evident:

  • It is extremely light weight
  • It has very low footprint in terms of network utilization, thus making the communication very fast and less taxing
  • It comes with a guaranteed delivery mechanism, ensuing that the data will eventually be delivered, even over fragile networks
  • It has low power consumption
  • It optimizes the flow of data packets over the wire to achieve low latency and lower footprints
  • It is a bi–directional protocol, and thus is suited both for transferring data from the device as well as transferring the data to the device
  • Its better suited for a situation in which we have to transmit a high volume of short messages over the wire

Edge analytics

Post evolution and IOT revolution, edge analytics are another significant game changer. If you look at IOT applications, the data from the sensors and devices needs to be collated and travels all the way to the distributed processing unit, which is either on the premises or on the cloud. This lift and shift of data leads to significant network utilization; it makes the overall solution latent to transmission delays.

These considerations led to the development of a new kind of solution and in turn a new arena of IOT computations — the term is edge analytics and, as the name suggests, it's all about pushing the processing to the edge, so that the data is processed at its source.

The following figure shows the bifurcation of IOT into:

  • Edge analytics
  • Core analytics

As depicted in the previous figure, the IOT computation is now divided into segments, as follows:

  • Sensor–level edge analytics: Wherein data is processed and some insights are derived at the device level itself
  • Edge analytics: These are the analytics wherein the data is processed and insights are derived at the gateway level
  • Core analytics: This flavour of analytics requires all data to arrive at a common compute engine (distributed storage and distributed computation) and then the high–complexity processing is done to derive actionable insights for people and processes

Some of the typical use cases for sensor/edge analytics are:

  • Industrial IOT (IIOT): The sensors are embedded in various pieces of equipment, machinery, and sometimes even shop floors. The sensors generate data and the devices have the innate capability to process the data and generate alerts/recommendations to improve the performance or yield.
  • IoT in health care: Smart devices can participate in edge processing and can contribute to detection of early warning signs and raise alerts for appropriate medical situations
  • In the world of wearable devices, edge processing can make tracking and safety very easy

Today, If I look around, my world is surrounded by connected devices—like my smart AC, my smart refrigerator, and smart TV; they all send out the data to a central hub or my mobile phone and are easily controllable from there. Now, the things are actually getting smart; they are evolving from being connected, to being smart enough to compute, process, and predict. For instance, my coffee maker is smart enough to be connected to my car traffic, my office timing, so that it predicts my daily routine and my arrival time and has hot fresh coffee ready the moment I need it.