Book Image

Time Series Analysis on AWS

By : Michaël Hoarau
Book Image

Time Series Analysis on AWS

By: Michaël Hoarau

Overview of this book

Being a business analyst and data scientist, you have to use many algorithms and approaches to prepare, process, and build ML-based applications by leveraging time series data, but you face common problems, such as not knowing which algorithm to choose or how to combine and interpret them. Amazon Web Services (AWS) provides numerous services to help you build applications fueled by artificial intelligence (AI) capabilities. This book helps you get to grips with three AWS AI/ML-managed services to enable you to deliver your desired business outcomes. The book begins with Amazon Forecast, where you’ll discover how to use time series forecasting, leveraging sophisticated statistical and machine learning algorithms to deliver business outcomes accurately. You’ll then learn to use Amazon Lookout for Equipment to build multivariate time series anomaly detection models geared toward industrial equipment and understand how it provides valuable insights to reinforce teams focused on predictive maintenance and predictive quality use cases. In the last chapters, you’ll explore Amazon Lookout for Metrics, and automatically detect and diagnose outliers in your business and operational data. By the end of this AWS book, you’ll have understood how to use the three AWS AI services effectively to perform time series analysis.
Table of Contents (20 chapters)
1
Section 1: Analyzing Time Series and Delivering Highly Accurate Forecasts with Amazon Forecast
9
Section 2: Detecting Abnormal Behavior in Multivariate Time Series with Amazon Lookout for Equipment
15
Section 3: Detecting Anomalies in Business Metrics with Amazon Lookout for Metrics

Recognizing the different families of time series

In this section, you will become familiar with the different families of time series. For any ML practitioner, it is obvious that a single image should not be processed like a video stream and that detecting an anomaly on an image requires a high enough resolution to capture the said anomaly. Multiple images from a certain subject (for example, pictures of a cauliflower) would not be very useful to teach an ML system anything about the visual characteristics of a pumpkin—or an aircraft, for that matter. As eyesight is one of our human senses, this may be obvious. However, we will see in this section and the following one (dedicated to challenges specific to time series) that the same kinds of differences apply to different time series.

There are four different families involved in time series data, which are outlined here:

  • Univariate time series data
  • Continuous multivariate data
  • Event-based multivariate data
  • Multiple time series data

Univariate time series data

A univariate time series is a sequence of single time-dependent values.

Such a series could be the energy output in kilowatt-hour (kWh) of a power plant, the closing price of a single stock market action, or the daily average temperature measured in Paris, France.

The following screenshot shows an excerpt of the energy consumption of a household:

Figure 1.1 – First rows and line plot of a univariate time series capturing the energy consumption of a household

Figure 1.1 – First rows and line plot of a univariate time series capturing the energy consumption of a household

A univariate time series can be discrete: for instance, you may be limited to the single daily value of stock market closing prices. In this situation, if you wanted to have a higher resolution (say, hourly data), you would end up with the same value duplicated 24 times per day.

Temperature seems to be closer to a continuous variable, for that matter: you can get a reading as frequently as you would wish, and you can expect some level of variation whenever you have a data point. You are, however, limited to the frequency at which the temperature sensor takes its reading (every 5 minutes for a home meteorological station or every hour in main meteorological stations). For practical purposes, most time series are indeed discrete, hence the definition called out earlier in this chapter.

The three services described in this book (Amazon Forecast, Amazon Lookout for Equipment, and Amazon Lookout for Metrics) can deal with univariate data to perform with various use cases.

Continuous multivariate data

A multivariate time series dataset is a sequence of many-valued vector values emitted at the same time. In this type of dataset, each variable can be considered individually or in the context shaped by the other variables as a whole. This happens when complex relationships govern the way these variables evolve with time (think about several engineering variables linked through physics-based equations).

An industrial asset such as an arc furnace (used in steel manufacturing) is running 24/7 and emits time series data captured by sensors during its entire lifetime. Understanding these continuous time series is critical to prevent any risk of unplanned downtime by performing the appropriate maintenance activities (a domain widely known under the umbrella term of predictive maintenance). Operators of such assets have to deal with sometimes thousands of time series generated at a high frequency (it is not uncommon to collect data with a 10-millisecond sampling rate), and each sensor is measuring a physical grandeur. The key reason why each time series should not be considered individually is that each of these physical grandeurs is usually linked to all the others by more or less complex physical equations.

Take the example of a centrifugal pump: such a pump transforms rotational energy provided by a motor into the displacement of a fluid. While going through such a pump, the fluid gains both additional speed and pressure. According to Euler's pump equation, the head pressure created by the impeller of the centrifugal pump is derived using the following expression:

In the preceding expression, the following applies:

  • H is the head pressure.
  • u denotes the peripheral circumferential velocity vector.
  • w denotes the relative velocity vector.
  • c denotes the absolute velocity vector.
  • Subscript 1 denotes the input variable (also called inlet for such a pump). For instance, w1 is the inlet relative velocity.
  • Subscript 2 denotes output variables (also called peripheral variables when dealing with this kind of asset). For instance, w2 is the peripheral relative velocity.
  • g is the gravitational acceleration and is a constant value depending on the latitude where the pump is located.

A multivariate time series dataset describing this centrifugal pump could include u1, u2, w1, w2, c1, c2, and H. All these variables are obviously linked together by the law of physics that governs this particular asset and cannot be considered individually as univariate time series.

If you know when this particular pump is in good shape, has had a maintenance operation, or is running through an abnormal event, you can also have an additional column in your time series capturing this state: your multivariate time series can then be seen as a related dataset that might be useful to try and predict the condition of your pump. You will find more details and examples about labels and related time series data in the Adding context to time series data section.

Amazon Lookout for Equipment, one of the three services described in this book, is able to perform anomaly detection on this type of multivariate dataset.

Event-based multivariate data

There are situations where data is continuously recorded across several operating modes: an aircraft going through different sequences of maneuvers from the pilot, a production line producing successive batches of different products, or rotating equipment (such as a motor or a fan) operating at different speeds depending on the need.

A multivariate time series dataset can be collected across multiple episodes or events, as follows:

  • Each aircraft flight can log a time series dataset from hundreds of sensors and can be matched to a certain sequence of actions executed by the aircraft pilot. Of course, a given aircraft will go through several overlapping maintenance cycles, each flight is different, and the aircraft components themselves go through a natural aging process that can generate additional stress and behavior changes due to the fatigue of going through hundreds of successive flights.
  • A beauty-care plant produces multiple distinct products on the same production line (multiple types and brands of shampoos and shower gels), separated by a clean, in-place process to ensure there is no contamination of a given product by a previous one. Each batch is associated with a different recipe, with different raw materials and different physical characteristics. Although the equipment and process time series are recorded continuously and can be seen as a single-flow variable indexed by time, they can be segmented by the batch they are associated with.

In some cases, a multivariate dataset must be associated with additional context to understand which operating mode a given segment of a time series can be associated with. If the number of situations to consider is reasonably low, a service such as Amazon Lookout for Equipment can be used to perform anomaly detection on this type of dataset.

Multiple time series data

You might also encounter situations where you have multiple time series data that does not form a multivariate time series dataset. These are situations where you have multiple independent signals that can each be seen as a single univariate time series. Although full independence might be debatable depending on your situation, there are no additional insights to be gained by considering potential relationships between the different univariate series.

Here are some examples of such a situation:

  • Closing price for multiple stock market actions: For any given company, the trading stock can be influenced by both exogenous factors (for example, a worldwide pandemic pushing entire countries into shelter-in-place situations) and endogenous decisions (board of directors' decisions; a strategic move from leadership; major innovation delivered by a research and development (R&D) team). Each stock price is not necessarily impacted by other companies' stock prices (competitors, partners, organizations operating on the same market).
  • Sold items for multiple products on an online retail store: Although some products might be related (more summer clothes when temperatures are rising again in spring), they do not have an influence on each other and they happen to sport similar behavior.

Multiple time series are hard to analyze and process as true multivariate time series data as the mechanics that trigger seemingly linked behaviors are most of the time coming from external factors (summer approaching and having a similar effect on many summer-related items). Modern neural networks are, however, able to train global models on all items at once: this allows them to uncover relationships and context that are not provided in the dataset to reach a higher level of accuracy than traditional statistical models that are local (univariate-focused) by nature.

We will see later on that Amazon Forecast (for forecasting with local and global models) and Amazon Lookout for Metrics (for anomaly detection on univariate business metrics) are good examples of services provided by Amazon that can deal with this type of dataset.