Book Image

Hands-On Time Series Analysis with R

By : Rami Krispin
Book Image

Hands-On Time Series Analysis with R

By: Rami Krispin

Overview of this book

Time-series analysis is the art of extracting meaningful insights from, and revealing patterns in, time-series data using statistical and data visualization approaches. These insights and patterns can then be utilized to explore past events and forecast future values in the series. This book explores the basics of time-series analysis with R and lays the foundation you need to build forecasting models. You will learn how to preprocess raw time-series data and clean and manipulate data with packages such as stats, lubridate, xts, and zoo. You will analyze data using both descriptive statistics and rich data visualization tools in R including the TSstudio, plotly, and ggplot2 packages. The book then delves into traditional forecasting models such as time-series linear regression, exponential smoothing (Holt, Holt-Winter, and more) and Auto-Regressive Integrated Moving Average (ARIMA) models with the stats and forecast packages. You'll also work on advanced time-series regression models with machine learning algorithms such as random forest and Gradient Boosting Machine using the h2o package. By the end of this book, you will have developed the skills necessary for exploring your data, identifying patterns, and building a forecasting model using various traditional and machine learning methods.
Table of Contents (14 chapters)

A brief introduction to R

Throughout the learning process in this book, we will use R intensively to introduce methods, techniques, and approaches for time series analysis. If you have never used R before, this section provides a brief introduction, which includes the basic foundations of R, the operators, the packages, different data structures, and loading data. This won't make you an R expert, but it will provide you with the basic R skills you will require to start the learning journey of this book.

R operators

Like any other programming language, the operators are one of the main elements of programming in R. The operators are a collection of functions that are represented by one or more symbols and can be categorized into four groups, as follows:

  • Assignment operators
  • Arithmetic operators
  • Logical operators
  • Relational operators

Assignment operators

Assignment operators are probably the family of operators that you will use the most while working with R. As the name of this group implies, they are used to assign objects such as numeric values, strings, vectors, models, and plots to a name (variable). This includes operators such as the back arrow (<-) or the equals sign (=):

# Assigning values to new variable
str <- "Hello World!" # String
int <- 10 # Integer
vec <- c(1,2,3,4) # Vector

We can use the print function to view the values of the objects:

print(c(str, int))
## [1] "Hello World!" "10"

This is one more example of the print function:

print(vec)
## [1] 1 2 3 4

While both of the operators can be used to assign values to a variable, it is not common to use the = symbol to assign values other than within functions (for reasons that are out of the scope of this book; more information about operator assignment is available on the assignOps function documentation or ?assignOps function).

Arithmetic operators

This family of operators includes basic arithmetic operations, such as addition, division, exponentiation, and remainder. As you can see, it is straightforward to apply these operators. We will start by assigning the values 10 and 2 to the x and y variables, respectively:

x <- 10
y <- 2

The following code shows the usage of the addition operator:

# Addition
x + y
## [1] 12

The following code shows the usage of the division operator:

x/ 2.5 
## [1] 4

The following code shows the usage of the exponentiation operator:

y ^ 3 
## [1] 8

Now, let's look at the logical operators.

Logical operators

Logical operators in R can be applied to numeric or complex vectors or Boolean objects, that is, TRUE or FALSE, where numbers greater than one are equivalent to TRUE. It is common to use those operators to test single or multiple conditions under the if…else statement:

# The following are reserved names in R for Boolean objects:
# TRUE, FALSE or their shortcut T and F
a <- TRUE
b <- FALSE

# We can also test if a Boolean object is TRUE or FALSE
isTRUE(a)
## [1] TRUE
isTRUE(b)
## [1] FALSE

The following code shows the usage of the AND operator:

# The AND operator
a & b
## [1] FALSE

The following code shows the usage of the OR operator:

# The OR operator
a | b
## [1] TRUE

The following code shows the usage of the NOT operator:

# The NOT operator
!a
## [1] FALSE

We can see the applications of those operators by using an if...else statement:

# The AND operator will return TRUE only if both a and b are TRUE
if (a & b) {
print("a AND b is true")
} else {
print("a And b is false")

The following code shows an example of the OR operator, along with the if...else statement:

# The OR operator will return FALSE only if both a and b are FALSE
if(a | b){
print("a OR b is true")
} else {
print("a OR b is false")
## [1] "a OR b is true"

Likewise, we can check whether the Boolean object is TRUE or FALSE with the isTRUE function:

isTRUE(a)
## [1] TRUE

Here, the condition is FALSE:

isTRUE(b)
## [1] FALSE

Now, let's look at relational operators.

Relational operators

These operators allow for the comparison of objects, such as numeric objects and symbols. Similar to logical operators, relational operators are mainly utilized for conditional statements such as if…else, while:

# Assign for variables a and b the value 5, and 7 to c
a <- b <- 5
c <- 7

The following code shows the use of the if…else statement, along with the output:

if(a == b){
print("a is equal to b")
} else{
print("a is not equal to b")
}
## [1] "a is equal to b"

Alternatively, you can use the ifelse function when you want to assign a value in an if…else structure:

d <- ifelse(test = a >= c, 
yes = "a is greater or equal to c",
no = "a is smaller than c" )

Here, the ifelse function has three arguments:

  • test: Evaluates a logical test
  • yes: Defines what should be the output if the test result is TRUE
  • no: Defines what should be the output if the test result is FALSE

Let's print the value of the d variable to check the output:

print(d)
## [1] "a is smaller than c"

As a core function of R, the operators are defined on the base package (one of R's inherent packages), where each group of operators is defined by a designated function. More information about the operators is available in the function documentation, which you can access with the help function (? or help()):

# Each package must have documentation for each function
# To access the function documentation use the ? or the help(function)
?assignOps
?Arithmetic
?Logic
?Comparison

Now, let's look at the R package.

The R package

The naked version of R (without any installed packages) comes with seven core packages that contain the built-in applications and functions of the software. This includes applications for statistics, visualization, data processing, and a variety of datasets. Unlike any other package, the core packages are inherent in R, and therefore they load automatically. Although the core packages provide many applications, the vast amount of the R applications are based on the uninherent packages that are stored on CRAN or in GitHub repository.

As of May 2019, there are more than 13,500 packages with applications for statistical modeling, data wrangling, and data visualization for a variety of domains (statistics, economics, finance, astronomy, and so on). A typical package may contain a collection of R functions, as well as compiled code (utilizing other languages, such as C, Java, and FORTRAN). Moreover, some packages include datasets that, in most cases, are related to the package's main application. For example, the forecast package comes with a time series dataset, which is used to demonstrate the forecasting models that are available in the package.

Installation and maintenance of a package

There are a few methods that you can use to install package, the most common of which is by using the install.packages function:

# Installing the forecast package:
install.packages("forecast")

You can use this function to install more than one package at once by using a vector type of input:

install.packages(c("TSstudio", "xts", "zoo"))

Most of the packages frequently get updates. This includes new features, improvements, and error fixing. R provides a function for updating your installed packages. The packageVersion function returns the version details of the input package:

packageVersion("forecast")
[1] '8.5'

The old.packages function identifies whether updates are available for any of the installed packages, and the update.packages function is used to update all of the installed packages automatically. You can update a specific package using the install.packages function, with the package name as input. For instance, if we wish to update the lubridate package, we can use the following code:

install.packages("lubridate")

Last but not least, removing a package can be done with the remove.packages function:

remove.packages("forecast")
When updating or removing an installed package with the install.packages or remove.packages functions, make sure that the package is not loaded to the working environment. The following section explains how to check whether a package has been loaded.

Loading a package in the R working environment

The R working environment defines the working space where the functions, objects, and data that are loaded are kept and are available to use. By default, when opening R, the global environment is loaded, and the built-in packages of R are loaded.

An installed package becomes available for use on the R global environment once it is loaded. The search function provides an overview of the loaded packages within your environment. For example, if we execute the search function when opening R, this is the output you expect to see:

search() 
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"

As you can see from the preceding output, currently, only the seven core packages of R are loaded. Loading a package into the environment can be done with either the library or the require function. While both of these functions will load an installed package and its attached functions, the require function is usually used within a function as it returns FALSE upon failure (compared to an error that the library function returns upon failure). Let's load the TSstudio package and see the change in environment:

library(TSstudio)

Now, we will check the global environment again and review the changes:

search()

We get the following output:

##  [1] ".GlobalEnv"        "package:TSstudio"  "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"

Similarly, you can unload a package from the environment by using the detach function:

detach("package:TSstudio", unload=TRUE) 

Let's check the working environment after detaching the package:

search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"

The key packages

Here is a short list of the key packages that we will use throughout this book by topic:

  • Data preparation and utility functions. These include the following::
    • stats: One of the base packages of R, this provides a set of statistical tools, including applications for time series, such as time series objects (ts) and the window function.
    • zoo and xts: With applications for data manipulation, aggregation, and visualization, these packages are some of the main tools that you use to handle time series data in an efficient manner.
    • lubridate: This provides a set of tools for handling a variety of dates objects and time formats.
    • dplyr: This is one of the main packages in R for data manipulation. This provides a powerful tool for data transformation and aggregation.
  • Data visualization and descriptive analysis. These include the following:
    • TSstudio: This package focuses on both descriptive and predictive analysis of time series data. It provides a set of interactive data visualizations tools, utility functions, and training methods for forecasting models. In addition, the package contains all the datasets that are used throughout this book.
    • ggplot2 and plotly: Packages for data visualization applications.
  • Predictive analysis, statistical modeling, and forecasting. These include the following:
    • forecast: This is one of the main packages for time series analysis in R and has a variety of applications for analyzing and forecasting time series data. This includes statistical models such as ARIMA, exponential smoothing, and neural network time series models, as well as automation tools.
    • h2o: This is one of the main packages in R for machine learning modeling. It provides machine learning algorithms such as Random Forest, gradient boosting machine, deep learning, and so on.

Variables

Variables in R have a broader definition and capabilities than most typical programming languages. Without the need to declare the type or the attribute, any R object can be assigned to a variable. This includes objects such as numbers, strings, vectors, tables, plots, functions, and models. The main features of these variables are as follows:

  • Flexibility: Any R object can be assigned to a variable, without any pre-step (such as declaring the variable type). Furthermore, when assigning the object to a new variable, all the attributes of the object transform, along with its content to the new variable.
  • Attribute: Neither the variable nor its attributes are needed to be defined prior to the assignment of the object. The object attribute passes to the variable upon assignment (this simplicity is one of the strengths of R). For example, we will assign the Hello World! string to the a variable:
a <- "Hello World!"

Let's look at the attributes of the a variable:

class(a)

We get the following output:

## [1] "character"

Now, let's assign the a variable to the b variable and check out the characteristics of the new variable:

b <- a
b

We get the following output:

## [1] "Hello World!"

Now, let's check the characteristics of the new variable:

class(b)

We get the following output:

## [1] "character"

As you can see, the b variable inherited both the value and attribute of the a variable.

  • Name: A valid variable name could consist of letters, numbers, and the dot or underline characters. However, it must start with either a letter or a dot, followed by a letter (that is, var_1, var.1, var1, and .var1 are examples of valid names, while 1var and .1var are examples of invalid names). In addition, there are sets of reserve names that R uses for its key operations, such as if, TRUE, and FALSE, and therefore cannot be used as variable names. Last but not least, variable names are case-sensitive. For example, Var_1 and var_1 will refer to two different variables.

Now that we have discussed operators, packages, and variables, it is time to jump into the water and start working with real data!

Importing and loading data to R

Importing or loading data is one of the key elements of the work flow in any analysis. R provides a variety of methods that you can use to import or load data into the environment, and it supports multiple types of data formats. This includes importing data from flat files (for example, CSV and TXT), web APIs or databases (SQL Server, Teradata, Oracle, and so on), and loading datasets from R packages. Here, we will focus on the main methods that we will use in this book—that is, importing data from flat files or the web API and loading data from the R package.

Flat files

It is rare to find a type of the available common data format that isn't possible to import directly to R from CSV and Excel formats to SPSS, SAS, and STATA files. RStudio has a built-in option that you can use to import datasets either from the environment quadrant or the main menu (File | Import Dataset). Files can be imported from your hard drive, the web, or other sources. In the following example, we will use the read.csv function to import a CSV file with information about the US monthly total vehicle sales from GitHub:

  1. First, let's assign the URL address to a variable:
file_url <- "https://raw.githubusercontent.com/PacktPublishing/Hands-On-Time-Series-Analysis-with-R/master/Chapter%201/TOTALNSA.csv"
  1. Next, we will use the read.csv function to read the file and assign it to an object named df1:
df1 <- read.csv(file = file_url, stringsAsFactors = FALSE)
  1. We can use class and str to review the characteristics of the object:
class(df1) 
## [1] "data.frame"
  1. The following code block shows the output of the str function:
str(df1) 
## 'data.frame':    504 obs. of  2 variables:
## $ Date : chr "1/31/1976" "2/29/1976" "3/31/1976" "4/30/1976" ...
## $ Value: num 885 995 1244 1191 1203 ...

The file path is stored in a variable for convenience. Alternatively, you can use the full path directly within the read.csv function. The stringsAsFactors option transforms strings into a categorical variable (factor) when TRUE; setting it to FALSE prevents this. The CSV file is stored in an object name, that is, df1 (df is a common abbreviation for data frame), which is where the read.csv file stores the table content in a data frame format. The str() function provides an overview of the key characteristics of the data frame. This includes the number of observations and variables, the class, and the first observation of each variable.

Be aware that some of your data attributes may get lost or change during the import process, mainly when working with non-numeric objects such as dates and mixed objects (numeric and characters). It is highly recommended to check data attributes once they are imported into the environment and reformat the data if needed. For example, a common change in the data attribute occurs when importing date or time objects from a TXT or CSV file. Since those objects are the key elements of the time series data, the following chapter focuses on handling and reformatting date and time objects. In Chapter 2, Working with Date and Time Objects, we will discuss how to handle the loss of the attributes of date and time objects when importing from external sources.

Web API

Since the ability to collect and store data has improved significantly in recent years, the use of the web API became more popular. It opens access for an enormous amount of data that is stored in a variety of databases, such as the Federal Reserve Economic Data (FRED), the Bureau of Labor Statistics, the World Bank, and Google Trends. In the following example, we will import the US total monthly vehicle sales dataset (https://fred.stlouisfed.org/series/TOTALNSA) again, this time using the Quandl API to source the data from FRED:

library(Quandl) 

df2 <- Quandl(code = "FRED/TOTALNSA",
type = "raw",
collapse = "monthly",
order = "asc",
end_date="2017-12-31")
Source

U.S. Bureau of Economic Analysis, Total Vehicle Sales [TOTALNSA], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/TOTALNSA, May 19, 2019.

The main arguments of the Quandl function are as follows:

  • code: This defines the source and name of the series. In this case, the source is FRED and the name of the series is TOTALNSA.
  • type: This is the data structure of the input series. This could be either raw, ts, zoo, xts, or timeSeries objects.
  • collapse: This sets the aggregation level of the series frequency. For example, if the raw series has a monthly frequency, you can aggregate the series to a quarterly or annually frequency.
  • order: This defines whether the series should be arranged in ascending or descending order.
  • end_date: This sets the ending date of the series.

Now, let's review the key characteristics of the new data frame:

class(df2)  
## [1] "data.frame"

This is the output when we use str(df2):

str(df2)
## 'data.frame':    504 obs. of  2 variables:
## $ Date : Date, format: "1976-01-31" "1976-02-29" ...
## $ Value: num 885 995 1244 1191 1203 ...
## - attr(*, "freq")= chr "monthly"

The Quandl function is more flexible than the read.csv function we used in the previous example. It allows the user to control the data format and preserve its attributes, customize the level of aggregation, and be a time saver. You can see that the structure of the df2 data frame is fairly similar to the one of the df1 data frame—a data frame with two variables and 504 observations. However, we were able to preserve the attribute of the Date variable (as opposed to the df1 data frame, where the Date variable transformed into character format).

R datasets

The R package, in addition to code and functions, may contain datasets that support any of the R designated formats (data frame, time series, matrix, and so on). In most cases, the use of the dataset is either related to the package's functionalities or for educational reasons. For example, the TSstudio package, which stores most time series datasets, will be used in this book. In the following example, we will load the US total monthly vehicle sales again, this time using the TSstudio package:

# If the package is not installed on your machine:
install.packages("TSstudio")
# Loading the series from the package
data("USVSales", package = "TSstudio")

The class(USVSales) function gives us the following output:

class(USVSales)
## [1] "ts"

The head(USVSales) function gives us the following output:

head(USVSales)
## [1] 885.2 994.7 1243.6 1191.2 1203.2 1254.7
Note that the data function is not assigning the object. Rather, it is loading it to the environment from the package. The main advantage of storing a dataset in a package format is that there is no loss of attributes (that is, there is no difference between the original object and the loaded object).

We used the data function to load the USVSales dataset of the TSstudio package. Alternatively, if you wish to assign the dataset to a variable, you can do either of the following:

  • Load the data into the working environment and then assign the loaded object to a new variable.
  • Assign directly from the package to a variable by using the :: operator. The :: operator allows you to call for objects from a package (for example, functions and datasets) without loading it into the working environment. For example, we can load the USVSales dataset series directly from the TSstudio package with the :: operator:
US_V_Sales <- TSstudio::USVSales

Note that the USVSales dataset series that we loaded from the TSstudio package is a time series (ts) object, that is, a built-in R time series class. In Chapter 3, The Time Series Object, we will discuss the ts class and its usage in more detail.