There are several basic data types in R for handling different types of data and values:

`numeric`

: The`numeric`

data type is used to store real or decimal vectors and is identical to the`double`

data type.`double`

: This data type can store and represent double precision vectors.`integer`

: This data type is used for representing 32-bit integer vectors.`character`

: This data type is used to represent character vectors, where each element can be a string of type character`logical`

: The reserved words`TRUE`

and`FALSE`

are logical constants in the R language and`T`

and`F`

are global variables. All these four are logical type vectors.`complex`

: This data type is used to store and represent complex numbers`factor`

: This type is used to represent nominal or categorical variables by storing the nominal values in a vector of integers ranging from (*1…n*) such that*n*is the number of distinct values for the variable. A vector of character strings for the actual variable values is then mapped to this vector of integersMiscellaneous: There are several other types including

`NA`

to denote missing values in data,`NaN`

which*denotes not a number*, and*ordered*is used for factoring ordinal variables

Common functions for each data type include `as`

and `is`

, which are used for converting data types (typecasting) and checking the data type respectively.

For example,
`as.numeric(…)`

would typecast the data or vector indicated by the ellipses into numeric type and `is.numeric(…)`

would check if the data is of numeric type.

Let us look at a few more examples for the various data types in the following code snippet to understand them better:

# typecasting and checking data types > n <- c(3.5, 0.0, 1.7, 0.0) > typeof(n) [1] "double" > is.numeric(n) [1] TRUE > is.double(n) [1] TRUE > is.integer(n) [1] FALSE > as.integer(n) [1] 3 0 1 0 > as.logical(n) [1] TRUE FALSE TRUE FALSE # complex numbers > comp <- 3 + 4i > typeof(comp) [1] "complex" # factoring nominal variables > size <- c(rep('large', 5), rep('small', 5), rep('medium', 3)) > size [1] "large" "large" "large" "large" "large" "small" "small" "small" "small" "small" [11] "medium" "medium" "medium" > size <- factor(size) > size [1] large large large large large small small small small small medium medium medium Levels: large medium small > summary(size) large medium small 5 3 5

The preceding examples should make the concepts clearer. Notice that non-zero numeric values are logically `TRUE`

always, and zero values are `FALSE`

, as we can see from typecasting our numeric vector to logical. We will now dive into the various data structures in R.

The base R system has several important core data structures which are extensively used in handling, processing, manipulating, and analyzing data. We will be talking about five important data structures, which can be classified according to the type of data which can be stored and its dimensionality. The classification is depicted in the following table:

Content type |
Dimensionality |
Data structure |
---|---|---|

Homogeneous |
One-dimensional |
Vector |

Homogeneous |
N-dimensional |
Array |

Homogeneous |
Two-dimensional |
Matrix |

Heterogeneous |
One-dimensional |
List |

Heterogeneous |
N-dimensional |
DataFrame |

The content type in the table depicts whether the data stored in the structure belongs to the same data type (homogeneous) or can contain data of different data types, (heterogeneous). The dimensionality of the data structure is pretty straightforward and is self-explanatory. We will now examine each data structure in further detail.

The vector is the most basic data structure in R and here vectors indicate atomic vectors. They can be used to represent any data in R including input and output data. Vectors are usually created using the `c(…)`

function, which is short for combine. Vectors can also be created in other ways such as using the `:`

operator or the `seq(…)`

family of functions. Vectors are homogeneous, all elements always belong to a single data type, and the vector by itself is a one-dimensional structure. The following snippet shows some vector representations:

> 1:5 [1] 1 2 3 4 5 > c(1,2,3,4,5) [1] 1 2 3 4 5 > seq(1,5) [1] 1 2 3 4 5 > seq_len(5) [1] 1 2 3 4 5

You can also assign vectors to variables and perform different operations on them, including data manipulation, mathematical operations, transformations, and so on. We depict a few such examples in the following snippet:

# assigning two vectors to variables > x <- 1:5 > y <- c(6,7,8,9,10) > x [1] 1 2 3 4 5 > y [1] 6 7 8 9 10 # operating on vectors > x + y [1] 7 9 11 13 15 > sum(x) [1] 15 > mean(x) [1] 3 > x * y [1] 6 14 24 36 50 > sqrt(x) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 # indexing and slicing > y[2:4] [1] 7 8 9 > y[c(2,3,4)] [1] 7 8 9 # naming vector elements > names(x) <- c("one", "two", "three", "four", "five") > x one two three four five 1 2 3 4 5

The preceding snippet should give you a good flavor of what we can do with vectors. Try playing around with vectors and transforming and manipulating data with them!

From the table of data structures we mentioned earlier, arrays can store homogeneous data and are N-dimensional data structures unlike vectors. Matrices are a special case of arrays with two dimensions, but more on that later. Considering arrays, it is difficult to represent data higher than two dimensions on the screen, but R can still handle it in a special way. The following example creates a three-dimensional array of *2x2x3*:

# create a three-dimensional array three.dim.array <- array( 1:12, # input data dim = c(2, 2, 3), # dimensions dimnames = list( # names of dimensions c("row1", "row2"), c("col1", "col2"), c("first.set", "second.set", "third.set") ) ) # view the array > three.dim.array , , first.set col1 col2 row1 1 3 row2 2 4 , , second.set col1 col2 row1 5 7 row2 6 8 , , third.set col1 col2 row1 9 11 row2 10 12

From the preceding output, you can see that R filled the data in the column-first order in the three-dimensional array. We will now look at matrices in the following section.

We've briefly mentioned that matrices are a special case of arrays with two dimensions. These two dimensions are represented by properties in `rows`

and `columns`

. Just like we used the `array(…)`

function in the previous section to create an array, we will be using the `matrix(…)`

function to create matrices.

The following snippet creates a *4x3* matrix:

# create a matrix mat <- matrix( 1:12, # data nrow = 4, # num of rows ncol = 3, # num of columns byrow = TRUE # fill the elements row-wise ) # view the matrix > mat [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12

Thus you can see from the preceding output that we have a *4x3* matrix with `4`

rows and `3`

columns and we filled in the data in row-wise fashion by using the by `row`

parameter in the `matrix(…)`

function.

The following snippet shows some mathematical operations with matrices which you should be familiar with:

# initialize matrices m1 <- matrix( 1:9, # data nrow = 3, # num of rows ncol = 3, # num of columns byrow = TRUE # fill the elements row-wise ) m2 <- matrix( 10:18, # data nrow = 3, # num of rows ncol = 3, # num of columns byrow = TRUE # fill the elements row-wise ) # matrix addition > m1 + m2 [,1] [,2] [,3] [1,] 11 13 15 [2,] 17 19 21 [3,] 23 25 27 # matrix transpose > t(m1) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 # matrix product > m1 %*% m2 [,1] [,2] [,3] [1,] 84 90 96 [2,] 201 216 231 [3,] 318 342 366

We encourage you to try out more complex operations using matrices. See if you can find the inverse of a matrix.

Lists are a special type of vector besides the atomic vector which we discussed earlier. The difference with atomic vectors is that lists are heterogeneous and can hold different types of data such that each element of a list can itself be a list, an atomic vector, an array, matrix, or even a function. The following snippet shows us how to create lists:

# create sample list list.sample <- list( nums = seq.int(1,5), languages = c("R", "Python", "Julia", "Java"), sin.func = sin ) # view the list > list.sample $nums [1] 1 2 3 4 5 $languages [1] "R" "Python" "Julia" "Java" $sin.func function (x) .Primitive("sin") # accessing individual list elements > list.sample$languages [1] "R" "Python" "Julia" "Java" > list.sample$sin.func(1.5708) [1] 1

You can see from the preceding snippet that lists can hold different types of elements and accessing them is really easy.

Let us look at a few more operations with lists in the following snippet:

# initializing two lists l1 <- list(nums = 1:5) l2 <- list( languages = c("R", "Python", "Julia"), months = c("Jan", "Feb", "Mar") ) # check lists and their type > l1 $nums [1] 1 2 3 4 5 > typeof(l1) [1] "list" > l2 $languages [1] "R" "Python" "Julia" $months [1] "Jan" "Feb" "Mar" > typeof(l2) [1] "list" # concatenating lists > l3 <- c(l1, l2) > l3 $nums [1] 1 2 3 4 5 $languages [1] "R" "Python" "Julia" $months [1] "Jan" "Feb" "Mar" # converting list back to a vector > v1 <- unlist(l1) > v1 nums1 nums2 nums3 nums4 nums5 1 2 3 4 5 > typeof(v1) [1] "integer"

Now that we know how lists work, we will be moving on to the last and perhaps most widely used data structure in data processing and analysis, the DataFrame.

The DataFrame is a special data structure which is used to handle heterogeneous data with N-dimensions. This structure is used to handle data tables or tabular data having several observations, samples or data points which are represented by rows, and attributes for each sample, which are represented by columns. Each column can be thought of as a dimension to the dataset or a vector. It is very popular since it can easily work with tabular data, notably spreadsheets.

The following snippet shows us how we can create DataFrames and examine their properties:

# create data frame df <- data.frame( name = c("Wade", "Steve", "Slade", "Bruce"), age = c(28, 85, 55, 45), job = c("IT", "HR", "HR", "CS") ) # view the data frame > df name age job 1 Wade 28 IT 2 Steve 85 HR 3 Slade 55 HR 4 Bruce 45 CS # examine data frame properties > class(df) [1] "data.frame" > str(df) 'data.frame': 4 obs. of 3 variables: $ name: Factor w/ 4 levels "Bruce","Slade",..: 4 3 2 1 $ age : num 28 85 55 45 $ job : Factor w/ 3 levels "CS","HR","IT": 3 2 2 1 > rownames(df) [1] "1" "2" "3" "4" > colnames(df) [1] "name" "age" "job" > dim(df) [1] 4 3

You can see from the preceding snippet how DataFrames can represent tabular data where each attribute is a dimension or column. You can also perform multiple operations on DataFrames such as merging, concatenating, binding, sub-setting, and so on. We will depict some of these operations in the following snippet:

# initialize two data frames emp.details <- data.frame( empid = c('e001', 'e002', 'e003', 'e004'), name = c("Wade", "Steve", "Slade", "Bruce"), age = c(28, 85, 55, 45) ) job.details <- data.frame( empid = c('e001', 'e002', 'e003', 'e004'), job = c("IT", "HR", "HR", "CS") ) # view data frames > emp.details empid name age 1 e001 Wade 28 2 e002 Steve 85 3 e003 Slade 55 4 e004 Bruce 45 > job.details empid job 1 e001 IT 2 e002 HR 3 e003 HR 4 e004 CS # binding and merging data frames > cbind(emp.details, job.details) empid name age empid job 1 e001 Wade 28 e001 IT 2 e002 Steve 85 e002 HR 3 e003 Slade 55 e003 HR 4 e004 Bruce 45 e004 CS > merge(emp.details, job.details, by='empid') empid name age job 1 e001 Wade 28 IT 2 e002 Steve 85 HR 3 e003 Slade 55 HR 4 e004 Bruce 45 CS # subsetting data frame > subset(emp.details, age > 50) empid name age 2 e002 Steve 85 3 e003 Slade 55

Now that we have a good grasp of data structures, we will look at concepts related to functions in R in the next section.

So far we have dealt with various variables and data types and structures for storing data. Functions are just another data type or object in R, albeit a special one which allows us to operate on data and perform actions on data. Functions are useful for modularizing code and separating concerns where needed by dedicating specific actions and operations to different functions and implementing the logic needed for any action inside the function. We will be talking about two types of functions in this section: the built-in functions and the user-defined functions.

There are several functions which come with the base installation of R and its core packages. You can access these built-in functions directly using the function name and you can get more functions as you install newer packages. We depict operations using a few built-in functions in the following snippet:

> sqrt(7) [1] 2.645751 > mean(1:5) [1] 3 > sum(1:5) [1] 15 > sqrt(1:5) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 > runif(5) [1] 0.8880760 0.2925848 0.9240165 0.6535002 0.1891892 > rnorm(5) [1] 1.90901035 -1.55611066 -0.40784306 -1.88185230 0.02035915

You can see from the previous examples that functions such as `sqrt(…)`

, `mean(…)`

, and `sum(…)`

are built-in and pre-implemented. They can be used anytime in R without the need to define these functions explicitly or load other packages.

While built-in functions are good, often you need to incorporate your own algorithms, logic, and processes for solving a problem. That is where you need to build you own functions. Typically, there are three main components in the function. They are mentioned as follows:

The

`environment(…)`

which contains the location map of the defined function and its variablesThe

`formals(…)`

which depict the list of arguments which are used to call the functionThe

`body(…)`

which is used to depict the code inside the function which contains the core logic of the function

Some depictions of user-defined functions are shown in the following code snippet:

# define the function square <- function(data){ return (data^2) } # inspect function components > environment(square) <environment: R_GlobalEnv> > formals(square) $data > body(square) { return(data^2) } # execute the function on data > square(1:5) [1] 1 4 9 16 25 > square(12) [1] 144

We can see how user-defined functions can be defined using `function(…)`

and we can also examine the various components of the function, as we discussed earlier, and also use them to operate on data.

When writing complete applications and scripts using R, the flow and execution of code is very important. The flow of code is based on statements, functions, variables, and data structures used in the code and it is all based on the algorithms, business logic, and other rules for solving the problem at hand. There are several constructs which can be used to control the flow of code and we will be discussing primarily the following two constructs:

Looping constructs

Conditional constructs

We will start with looking at various looping constructs for executing the same sections of code multiple times.

Looping constructs basically involve using loops which are used to execute code blocks or sections repeatedly as needed. Usually the loop keeps executing the code block in its scope until some specific condition is met or some other conditional statements are used. There are three main types of loops in R:

`for`

`while`

`repeat`

We will explore all the three constructs with examples in the following code snippet:

# for loop > for (i in 1:5) { + cat(paste(i," ")) + } 1 2 3 4 5 > sum <- 0 > for (i in 1:10){ + sum <- sum + i + } > sum [1] 55 # while loop > n <- 1 > while (n <= 5){ + cat(paste(n, " ")) + n <- n + 1 + } 1 2 3 4 5 # repeat loop > i <- 1 > repeat{ + cat(paste(i, " ")) + if (i >= 5){ + break # break out of the infinite loop + } + i <- i + 1 + } 1 2 3 4 5

An important point to remember here is that, with larger amounts of data, vectorization-based constructs are more optimized than loops and we will cover some of them in the *Advanced operations* section later.

There are several conditional constructs which help us in executing and controlling the flow of code conditionally based on user-defined rules and conditions. This is very useful when we do not want to execute all possible code blocks in a script sequentially but we want to execute specific code blocks if and only if they meet or do not meet specific conditions.

There are mainly four constructs which are used frequently in R:

`if`

or`if…else`

`if…else`

`if…else`

`ifelse(…)`

`switch(…)`

The bottom two are functions compared to the other statements, which use the `if`

, `if…else`

, and `if…else if…else`

syntax. We will look at them in the following code snippet with examples:

# using if > num = 10 > if (num == 10){ + cat('The number was 10') + } The number was 10 # using if-else > num = 5 > if (num == 10){ + cat('The number was 10') + } else{ + cat('The number was not 10') + } The number was not 10 # using if-else if-else > if (num == 10){ + cat('The number was 10') + } else if (num == 5){ + cat('The number was 5') + } else{ + cat('No match found') + } The number was 5 # using ifelse(...) function > ifelse(num == 10, "Number was 10", "Number was not 10") [1] "Number was not 10" # using switch(...) function > for (num in c("5","10","15")){ + cat( + switch(num, + "5" = "five", + "7" = "seven", + "10" = "ten", + "No match found" + ), "\n") + } five ten No match found

From the preceding snippet, we can see that `switch(…)`

has a default option which can return a user-defined value when no match is found when evaluating the condition.

We can perform several advanced vectorized operations in R, which is useful when dealing with large amounts of data and improves code performance with regards to time taken in executing code. Some advanced constructs in the `apply`

family of functions will be covered in this section, as follows:

`apply`

: Evaluates a function on the boundaries or margins of an array`lapply`

: Loops over a list and evaluates a function on each element`sapply`

: A more simplified version of the`lapply(…)`

function`tapply`

: Evaluates a function over subsets of a vector`mapply`

: A multivariate version of the`lapply(…)`

function

Let's look at how each of these functions work in further detail.

As we mentioned earlier, the `apply(…)`

function is used mainly to evaluate any defined function over the margins or boundaries of any array or matrix.

An important point to note here is that there are dedicated aggregation functions `rowSums(…)`

, `rowMeans(…)`

, `colSums(…)`

and `colMeans(…)`

which actually use `apply`

internally but are more optimized and useful compared to other functions when operating on large arrays.

The following snippet depicts some aggregation functions being applied to a matrix:

# creating a 4x4 matrix > mat <- matrix(1:16, nrow=4, ncol=4) # view the matrix > mat [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 # row sums > apply(mat, 1, sum) [1] 28 32 36 40 > rowSums(mat) [1] 28 32 36 40 # row means > apply(mat, 1, mean) [1] 7 8 9 10 > rowMeans(mat) [1] 7 8 9 10 # col sums > apply(mat, 2, sum) [1] 10 26 42 58 > colSums(mat) [1] 10 26 42 58 # col means > apply(mat, 2, mean) [1] 2.5 6.5 10.5 14.5 > colMeans(mat) [1] 2.5 6.5 10.5 14.5 # row quantiles > apply(mat, 1, quantile, probs=c(0.25, 0.5, 0.75)) [,1] [,2] [,3] [,4] 25% 4 5 6 7 50% 7 8 9 10 75% 10 11 12 13

You can see aggregations taking place without the need of extra looping constructs.

The `lapply(…)`

function takes a list and a function as input parameters. Then it evaluates that function over each element of the list. If the input list is not a list, it is coerced to a list using the `as.list(…)`

function before the final output is returned. All operations are vectorized and we will see an example in the following snippet:

# create and view a list of elements > l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2)) > l $nums [1] 1 2 3 4 5 6 7 8 9 10 $even [1] 2 4 6 8 10 $odd [1] 1 3 5 7 9 # use lapply on the list > lapply(l, sum) $nums [1] 55 $even [1] 30 $odd [1] 25

The `sapply(…)`

function is quite similar to the `lapply(…)`

function, the only exception is that it will always try to simplify the final results of the computation. Suppose the final result is such that every element is of length `1`

, then `sapply(…)`

would return a vector. If the length of every element in the result is greater than `1`

, then a matrix would be returned. If it is not able to simplify the results, then we end up getting the same result as `lapply(…)`

. The following example will make things clearer:

# create and view a sample list > l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2)) > l $nums [1] 1 2 3 4 5 6 7 8 9 10 $even [1] 2 4 6 8 10 $odd [1] 1 3 5 7 9 # observe differences between lapply and sapply > lapply(l, mean) $nums [1] 5.5 $even [1] 6 $odd [1] 5 > typeof(lapply(l, mean)) [1] "list" > sapply(l, mean) nums even odd 5.5 6.0 5.0 > typeof(sapply(l, mean)) [1] "double"

The `tapply(…)`

function is used to evaluate a function over specific subsets of input vectors. These subsets can be defined by the users.

The following example depicts the same:

> data <- 1:30 > data [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 > groups <- gl(3, 10) > groups [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 Levels: 1 2 3 > tapply(data, groups, sum) 1 2 3 55 155 255 > tapply(data, groups, sum, simplify = FALSE) $'1' [1] 55 $'2' [1] 155 $'3' [1] 255

The `mapply(…)`

function is used to evaluate a function in parallel over sets of arguments. This is basically a multi-variate version of the `lapply(…)`

function.

The following example shows how we can build a list of vectors easily with `mapply(…)`

as compared to using the `rep(…)`

function multiple times otherwise:

> list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4 > mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4

One of the most important aspects of data analytics is to depict meaningful insights with crisp and concise visualizations. Data visualization is one of the most important aspects of exploratory data analysis, as well as an important medium to present results from any analyzes. There are three main popular plotting systems in R:

The base plotting system which comes with the basic installation of R

The lattice plotting system which produces better looking plots than the base plotting system

The

`ggplot2`

package which is based on the grammar of graphics and produces beautiful, publication quality visualizations

The following snippet depicts visualizations using all three plotting systems on the popular `iris`

dataset:

# load the data > data(iris) # base plotting system > boxplot(Sepal.Length~Species,data=iris, + xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This gives us a set of boxplots using the base plotting system as depicted in the following plot:

# lattice plotting system > bwplot(Sepal.Length~Species,data=iris, xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This snippet helps us in creating a set of boxplots for the various species using the lattice plotting system:

# ggplot2 plotting system > ggplot(data=iris, aes(x=Species, y=Sepal.Length)) + geom_boxplot(aes(fill=Species)) + + ylab("Sepal Length") + ggtitle("Iris Boxplot") + + stat_summary(fun.y=mean, geom="point", shape=5, size=4) + theme_bw()

This code snippet gives us the following boxplots using the `ggplot2`

plotting system:

You can see from the preceding visualizations how each plotting system works and compare the plots across each system. Feel free to experiment with each system and visualize your own data!

This quick refresher in getting started with R should gear you up for the upcoming chapters and also make you more comfortable with the R eco-system, its syntax and features. It's important to remember that you can always get help with any aspect of R in a variety of ways. Besides that, packages in R are perhaps going to be the most important tool in your arsenal for analyzing data. We will briefly list some important commands for each of these two aspects which may come in handy in the future.

R has thousands of packages, functions, constructs and structures! Hence it is impossible for anyone to keep track and remember of them all. Luckily, help in R is readily available and you can always get detailed information and documentation with regard to R, its utilities, features and packages using the following commands:

`help(<any_R_object>)`

or`?<any_R_object>`

: This provides help on any R object including functions, packages, data types, and so on`example(<function_name>)`

: This provides a quick example for the mentioned function`apropos(<any_string>)`

: This lists all functions containing the`any_string`

term

R, being open-source and a community-driven language, has tons of packages for helping you with your analyzes which are free to download, install and use.

In the R community, the term library is often used interchangeably with the term package.

R provides the following utilities to manage packages:

`install.packages(…)`

: This installs a package from**Comprehensive R Archive Network**(**CRAN**). CRAN helps maintain and distribute the various versions and documentation of R`libPaths(…)`

: This adds this library path to R`installed.packages(lib.loc=)`

: This lists installed packages`update.packages(lib.loc=)`

: This updates a package`remove.packages(…)`

: This removes a package`path.package(…)`

: This is the package loaded for the session`library(…)`

: This loads a package in a script to use its functions or utilities`library(help=)`

: This lists the functions in a package

This brings us to the end of our section on getting started with R and we will now look at some aspects of analytics, machine learning and text analytics.