Book Image

Analytics for the Internet of Things (IoT)

By : Andrew Minteer
5 (1)
Book Image

Analytics for the Internet of Things (IoT)

5 (1)
By: Andrew Minteer

Overview of this book

We start with the perplexing task of extracting value from huge amounts of barely intelligible data. The data takes a convoluted route just to be on the servers for analysis, but insights can emerge through visualization and statistical modeling techniques. You will learn to extract value from IoT big data using multiple analytic techniques. Next we review how IoT devices generate data and how the information travels over networks. You’ll get to know strategies to collect and store the data to optimize the potential for analytics, and strategies to handle data quality concerns. Cloud resources are a great match for IoT analytics, so Amazon Web Services, Microsoft Azure, and PTC ThingWorx are reviewed in detail next. Geospatial analytics is then introduced as a way to leverage location information. Combining IoT data with environmental data is also discussed as a way to enhance predictive capability. We’ll also review the economics of IoT analytics and you’ll discover ways to optimize business value. By the end of the book, you’ll know how to handle scale for both data storage and analytics, how Apache Spark can be leveraged to handle scalability, and how R and Python can be used for analytic modeling.
Table of Contents (20 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

IoT analytics challenges


There are some special challenges that come along with IoT data. The data was created by devices operating remotely, sometimes in widely varying environmental conditions that can change from day to day. The devices are often distributed widely geographically.

The data is communicated over long distances, often across different networking technologies. It is very common for data to first transmit across a wireless network, then through a type of gateway device to be sent over the public internet–which itself includes multiple different types of networking technology working together.

The data volume

A company can easily have thousands to millions of IoT devices with several sensors on each unit, each sensor reporting values on a regular basis. The inflow of data can grow quite large very quickly. Since IoT devices send data on an ongoing basis, the volume of data in total can increase much faster than many companies are used to.

To demonstrate how this can happen, imagine a company that manufactures small monitoring devices. It produces 12,000 devices a year, starting in 2010 when the product was launched. Each one is tested at the end of assembly and the values reported by the sensors on the device are kept for analysis for five years. The data growth looks like the following image:

A chart showing data storage needs for production snapshot of 200 KB and 1,000 units per month. Five years of production data is kept

Now, imagine the device also had internet connectivity to track sensor values, and each one remains connected for two years. Since the data inflow continues well after the devices are built, data growth is exponential until it stabilizes when older devices stop reporting values. This looks more like the blue area in the following chart:

Chart shows the addition of IoT data at 0.5 KB per message, 10 messages per day. Devices are connected for two years from production

In order to illustrate how large this can get, consider the following example. If you capture 10 messages per day and the message size is half of a full production snapshot, by 2017, data storage requirements would be over 1,500 times higher than production-only data.

For many companies, this introduces some problems. The database software, storage infrastructure, and available computing horsepower is not typically intended to handle this kind of growth. The licensing agreements with software vendors tends to be tied to the number of servers and CPU cores. Storage is handled by standard backup planning and retention policies.

The data volume rapidly leads to computing and storage requirements well beyond what can be held by a single server. It gets cost prohibitive very quickly under traditional architectures to distribute it across hundreds or thousands of servers. To do the best analytics, you need lots of historical data, and since you are unlikely to know ahead of time which data is most predictive, you have to keep as much as you can on hand.

With large-scale data, computing horsepower requirements for analytics are not very predictable and change dramatically depending on the question being asked. Analytic needs are very elastic. Traditional server planning ratchets up on premise resources with the anticipated number of servers needed to meet peak needs determined in advance. Doubling compute power in a short amount of time, if even possible, is very expensive.

IoT data volumes and computing resource requirements can quickly outpace all the other company data needs combined.

Problems with time

The only reason for time is so that everything doesn't happen at once.

– Albert Einstein

Time is very tightly tied to geographical position and the date on the calendar. The international standard way of tracking a common time is using Coordinated Universal Time (UTC). UTC is geographically tied to 00 longitude, which passes through Greenwich, England, in the UK. Although it is tied to the location, it is actually not the same as Greenwich Mean Time (GMT). GMT is a time zone, while UTC is a time standard. UTC does not observe Daylight Savings Time (DST):

Standard time zones of the World. Source: CIA Factbook

When data used for analytics is recorded at headquarters or a manufacturing plant, everything happens at the same place and time zone. IoT devices are spread out across the globe. Events that happen at the absolute same time do not happen at the same local time. How time is recorded affects the integrity of the resulting analytics.

When IoT devices communicate sensor data, time may be captured using the local time. It can dramatically affect analytics results if it is not clear whether local time or UTC was recorded. For example, consider an analyst working at a company that makes parking spot occupancy detection sensors. She is tasked with creating predictive models to estimate future parking lot fill rates. The time of day is likely to be a very predictive data point. It makes a big difference to her on how this time is recorded. Even determining if it is night or day at the sensor location will be difficult.

This may not be apparent to the engineer creating the device. His task is to design a device that determines if the spot is open or not. He may not appreciate the importance of writing code that captures a time value that can be aggregated across multiple time zones and locations.

There can also be issues with clock synchronization. Devices set their internal clock to be in sync with the time standard being used. If it is local time, it could be using the wrong time zone due to a configuration error. It could also get out of sync due to a communication problem with the time standard source.

If local time is being used, daylight savings time can cause problems. How will the events that happen between 1 a.m. and 2 a.m. on the day autumn daylight savings is adjusted be recorded since that hour happens twice? Laws that determine which days mark daylight savings time can change, as they did in Turkey when DST was scrapped in September 2016. If the device is locked into a set date range at the time of manufacture, the time would be incorrect for several days out of the year after the DST dates change.

How daylight savings time changes is different from country to country. In the United States, daylight savings time is changed at 02:00 local time in each time zone. In the European Union, it is coordinated so that all EU countries change at 01:00 GMT for all time zones at once. This keeps time zones always an hour apart at the expense of it changing at different local times for each time zone.

In early 2008, Central Brazil was one, two, or three hours ahead of eastern U.S., depending on the date

Source: Wikipedia commons

When time is recorded for an event, such as a parking spot being vacated, it is essential for analytics that the time is as close to the actual occurrence as possible. In practice, though, the time available for analytics can be the time the event occurred, the time the IoT device sent the data, the time the data was received, or the time the data was added to your data warehouse.

Problems with space

IoT devices are located in multiple geographic locations. Different areas of the world have different environmental conditions. Temperature variations can affect sensor accuracy. You could have less accurate readings in Calgary, Canada than in Cancun, Mexico, if cold impacts your device.

Elevation can affect equipment such as diesel engines. If location and elevation is not taken into consideration, you may falsely conclude from IoT sensor readings that a Denver-based fleet of delivery trucks is poorly managing fuel economy compared to a fleet in Indiana. Lots of mountain roads can burn up some fuel!

US elevation profile from LA to NYC. Source: reddit.com

Remote locations may have weaker network access. The higher data loss could cause data values for those locations to be underrepresented in the resulting analytics.

Many IoT devices are solar powered. The available battery charge can affect the frequency of data reporting. A device in Portland, Oregon, where it is often cloudy and rainy will be more impacted than the same device in Phoenix, Arizona, where it is mostly sunny.

There are also political considerations related to the location of the IoT device. Privacy laws in Europe affect how the data from devices can be stored and what type of analytics is acceptable. You may be required to anonymize the data from certain countries, which can affect what you can do with analytics.

Data quality

Constrained devices means lossy networks. For analytics, it often results in either missing or inconsistent data. The missing data is often not random. As mentioned previously, it can be impacted by the location. Devices run on a software, called firmware, which may not be consistent across locations. This could mean differences in reporting frequency or formatting of values. It can result in lost or mangled data.

Data messages from IoT devices often require the destination to know how to interpret the message being sent. Software bugs can lead to garbled messages and data records.

Messages lost in translation or never sent due to dead batteries result in missing values. The conservation of power often means not all values available on the device are sent at the same time. The resulting datasets often have missing values, as the device sends some values consistently every time it reports and sends some other values less frequently.

Analytics challenges

Analytics often requires deciding on whether to fill in or ignore the missing values. Either choice may lead to a dataset that is not a representative of reality.

As an example of how this can affect results, consider the case of inaccurate political poll results in recent years. Many experts believe it is now in near crisis due to the shift of much of the world to mobile numbers as their only phone number. For pollsters, it is cheaper and easier to reach people on landline numbers. This can lead to the over representation of people with landlines. These people tend to be both older and wealthier than mobile-only respondents.

The response rate has also dropped from near 80% in the 1970s to about 8% (if you are lucky) today. This makes it more difficult (and expensive) to obtain a representative sample leading to many embarrassingly wrong poll predictions.

There can also be outside influences, such as environment conditions, that are not captured in the data. Winter storms can lead to power failures affecting devices that are able to report back data. You may end up drawing conclusions based on a non-representative sample of data without realizing it. This can affect the results of IoT analytics – and it will not be clear why.

Since connectivity is a new thing for many devices, there is also often a lack of historical data to base predictive models on. This can limit the type of analytics that can be done with the data.

It can also lead to a recency bias in datasets, as newer products are over represented in the data simply because a higher percentage are now a part of the IoT.

This leads us to the author's number one rule in IoT analytics:

Never trust data you don't know.

Treat it like a stranger offering you candy.