Book Image

Machine Learning with R Quick Start Guide

By : Iván Pastor Sanz
Book Image

Machine Learning with R Quick Start Guide

By: Iván Pastor Sanz

Overview of this book

Machine Learning with R Quick Start Guide takes you on a data-driven journey that starts with the very basics of R and machine learning. It gradually builds upon core concepts so you can handle the varied complexities of data and understand each stage of the machine learning pipeline. From data collection to implementing Natural Language Processing (NLP), this book covers it all. You will implement key machine learning algorithms to understand how they are used to build smart models. You will cover tasks such as clustering, logistic regressions, random forests, support vector machines, and more. Furthermore, you will also look at more advanced aspects such as training neural networks and topic modeling. By the end of the book, you will be able to apply the concepts of machine learning, deal with data-related problems, and solve them using the powerful yet simple language that is R.
Table of Contents (9 chapters)

Objects, special cases, and basic operators in R

By now, you will have figured out that R is an object-oriented language. All our variables, data, and functions will be stored in the active memory of the computer as objects. These objects can be modified using different operators or functions. An object in R has two attributes, namely, mode and length.

Mode includes the basic type of elements and has four options:

  • Numeric: These are decimal numbers
  • Character: Represents sequences of string values
  • Complex: Combination of real and imaginary numbers, for example, x+ai
  • Logical: Either true (1) or false (0)

Length means the number of elements in an object.

In most cases, we need not care whether or not the elements of a numerical object are integers, reals, or even complexes. Calculations will be carried out internally as numbers of double precision, real, or complex, depending on the case. To work with complex numbers, we must indicate explicitly the complex part.

In case an element or value is unavailable, we assign NA, a special value. Usually, operations with NA elements result in NA unless we are using some functions that can treat missing values in some way or omit them. Sometimes, calculations can lead to answers with a positive or negative infinite value (represented by R as Inf or -Inf, respectively). On the other hand, certain calculations lead to expressions that are not numbers represented by R as NaN (short for not a number).

Working with objects

You can create an object using the <- operator:

n<-10
n
## [1] 10

In the preceding code, an object called n is created. A value of 10 has been assigned to this object. The assignment can also be made using the assign() function, although this isn't very common.

Once the object has been created, it is possible to perform operations on it, like in any other programming language:

n+5
## [1] 15

These are some examples of basic operations.

Let's create our variables:

x<-4
y<-3

Now, we can carry out some basic operations:

  • Sum of variables:
x + y
## [1] 7
  • Subtraction of variables:
x - y
## [1] 1
  • Multiplication of variables:
x * y
## [1] 12
  • Division of variables:
x / y
## [1] 1.333333
  • Power of variables:
x ** y
## [1] 64

Likewise in R, there are defined constants that are widely used, such as the following ones:

  • The pi () number :
x * pi
## [1] 12.56637
  • Exponential function:
exp(y)
## [1] 20.08554

There are also functions for working with numbers, such as the following:

  • Sign (positive or negative of a number):
sign(y)
## [1] 1
  • Finding the maximum value:
max(x,y)
## [1] 4
  • Finding the minimum value:
min(x,y)
## [1] 3
  • Factorial of a number:
factorial(y)
## [1] 6
  • Square root function:
sqrt(y)
## [1] 1.732051

It is also possible to assign the result of previous operations to another object. For example, the sum of variables x and y is assigned to an object named z:

z <- x + y
z
## [1] 7

As shown previously, these functions apply if the variables are numbers, but there are also other operators to work with strings:

x > y
## [1] TRUE
x + y != 8
## [1] TRUE

The main logical operators are summarized in the following table:

Operator Description
< Less than
<= Less than or equal to
> Greater than
>= Greater than or equal to
== Equal to
!= Not equal to
!x Not x
x y
x & y x and y
isTRUE(x) Test if x is TRUE

Working with vectors

A vector is one of the basic data structures in R. It contains only similar elements, like strings and numbers, and it can have data types such as logical, double, integer, complex, character, or raw. Let's see how vectors work.

Let's create some vectors by using c():

a<-c(1,3,5,8)
a
## [1] 1 3 5 8

On mixing different objects with vector elements, there is a transformation of the elements so that they belong to the same class:

y <- c(1,3)
class(y)
## [1] "numeric"

When we apply commands and functions to a vector variable, they are also applied to every element in the vector:

y <- c(1,5,1)
y + 3
## [1] 4 8 4

You can use the : operator if you wish to create a vector of consecutive numbers:

c(1:10)
## [1] 1 2 3 4 5 6 7 8 9 10

Do you need to create more complex vectors? Then use the seq() function. You can create vectors as complex as number of points in an interval or even to find out the step size that we might need in machine learning:

seq(1, 5, by=0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3
## [35] 4.4 4.5 4.6 4.7 4.8 4.9 5.0
seq(1, 5, length.out=22)
## [1] 1.000000 1.190476 1.380952 1.571429 1.761905 1.952381 2.142857
## [8] 2.333333 2.523810 2.714286 2.904762 3.095238 3.285714 3.476190
## [15] 3.666667 3.857143 4.047619 4.238095 4.428571 4.619048 4.809524
## [22] 5.000000

The rep() function is used to repeat the value of x, n number of times:

rep(3,20)
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Vector indexing

Elements of a vector can be arranged in several haphazard ways, which can make it difficult to access them when needed. Hence, indexing makes it easier to access the elements.

You can have any type of index vectors, from logical, integer, and character.

Vector of integers starting from 1 can be used to specify elements in a vector, and it is also possible to use negative values.

Let's see some examples of indexing:

  • Returns the nth element of x:
x <- c(9,8,1,5)
  • Returns all x values except the nth element:
x[-3]
## [1] 9 8 5

  • Returns values between a and b:
x[1:2]
## [1] 9 8
  • Returns items that are greater than a and less than b:
x[x>0 & x<4]
## [1] 1

Moreover, you can even use a logical vector. In this case, either TRUE or FALSE will be returned if an element is present at that position:

x[c(TRUE, FALSE, FALSE, TRUE)]
## [1] 9 5

Functions on vectors

In addition to the functions and operators that we've seen for numerical values, there are some specific functions for vectors, such as the following:

  • Sum of the elements present in a vector:
sum(x)
## [1] 23
  • Product of elements in a vector:
prod(x)
## [1] 360
  • Length of a vector:
length(x)
## [1] 4
  • Modifying a vector using the <- operator:
x
## [1] 9 8 1 5
x[1]<-22
x
## [1] 22 8 1 5

Factor

A vector of strings of a character is known as a factor. It is used to represent categorical data, and may also include the different levels of the categorical variable. Factors are created with the factor command:

r<-c(1,4,7,9,8,1)
r<-factor(r)
r
## [1] 1 4 7 9 8 1
## Levels: 1 4 7 8 9

Factor levels

Levels are possible values that a variable can take. Suppose the original value of 1 is repeated; it will appear only once in the levels.

Factors can either be numeric or character variables, but levels of a factor can only be characters.

Let's run the level command:

levels(r)
## [1] "1" "4" "7" "8" "9"

As you can see, 1, 4, 7, 8, and 9 are the possible levels that the level r can have.

The exclude parameter allows you to exclude levels of a custom factor:

factor(r, exclude=4)
## [1] 1 <NA> 7 9 8 1
## Levels: 1 7 8 9

Finally, let's find out if our factor values are ordered or unordered:

a<- c(1,2,7,7,1,2,2,7,1,7)
a<- factor(a, levels=c(1,2,7), ordered=TRUE)
a
## [1] 1 2 7 7 1 2 2 7 1 7
## Levels: 1 < 2 < 7

Strings

Any value that is written in single or double quotes will be considered a string:

c<-"This is our first string"
c
## [1] "This is our first string"
class(c)
## [1] "character"
When I say single quotes are allowed, please know that even if you specify the string in single quotes, R will always store them as double quotes.

String functions

Let's see how we can transform or convert strings using R.

The most relevant string examples are as follows:

  • To know the number of characters in a string:
nchar(c)
## [1] 24
  • To return the substring of x, originating at a particular character in x:
substring(c,4)
## [1] "s is our first string"
  • To return the substring of x originating at one character located at n and ending at another character located at a place, m:
substring(c,1,4)
## [1] "This"
  • To divide the string x into a list of sub chains using the delimiter as a separator:
strsplit(c, " ")
## [[1]]
## [1] "This" "is" "our" "first" "string"

  • To check if the given pattern is in the string, and in that case returns true (or 1):
grep("our", c)
## [1] 1
grep("book", c)
## integer(0)
  • To look for the first occurrence of a pattern in a string:
regexpr("our", c)
## [1] 9
## attr(,"match.length")
## [1] 3
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
  • To convert the string into lowercase:
tolower(c)
## [1] "this is our first string"
  • To convert the string into capital letters:
toupper(c)
## [1] "THIS IS OUR FIRST STRING"
  • To replace the first occurrence of the pattern by the given value with a string:
sub("our", "my", c)
## [1] "This is my first string"
  • To replace the occurrences of the pattern with the given value with a string:
gsub("our", "my", c)
## [1] "This is my first string"
  • To return the string as elements of the given array, separated by the given separator using paste(string,array, sep=“Separator”):
paste(c,"My book",sep=" : ")
## [1] "This is our first string : My book"

Matrices

You might know that a standard matrix has a two-dimensional, rectangular layout. Matrices in R are no different than a standard matrix.

Representing matrices

To represent a matrix of n elements with r rows and c columns, the matrix command is used:

m<-matrix(c(1,2,3,4,5,6), nrow=2, ncol=3)
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6

Creating matrices

A matrix can be created by rows instead of by columns, which is done by using the byrow parameter, as follows:

m<-matrix(c(1,2,3,4,5,6), nrow=2, ncol=3,byrow=TRUE)
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6

With the dimnames parameter, column names can be added to the matrix:

m<-matrix(c(1,2,3,4,5,6), nrow=2, ncol=3,byrow=TRUE,dimnames=list(c('Obs1', 'Obs2'), c('col1', 'Col2','Col3')))
m
## col1 Col2 Col3
## Obs1 1 2 3
## Obs2 4 5 6

There are three more alternatives to creating matrices:

rbind(1:3,4:6,10:12)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 10 11 12
cbind(1:3,4:6,10:12)
## [,1] [,2] [,3]
## [1,] 1 4 10
## [2,] 2 5 11
## [3,] 3 6 12
m<-array(c(1,2,3,4,5,6), dim=c(2,3))
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6

Accessing elements in a matrix

You can access the elements in a matrix in a similar way to how you accessed elements of a vector using indexing. However, the elements here would be the index number of rows and columns.

Here a some examples of accessing elements:

  • If you want to access the element at a second column and first row:
m<-array(c(1,2,3,4,5,6), dim=c(2,3))
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
m[1,2]
## [1] 3
  • Similarly, accessing the element at the second column and second row:
m[2,2]
## [1] 4
  • Accessing the elements in only the second row:
m[2,]
## [1] 2 4 6
  • Accessing only the first column:
m[,1]
## [1] 1 2

Matrix functions

Furthermore, there are specific functions for matrices:

  • The following function extracts the diagonal as a vector:
m<-matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, ncol=3)
m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
diag(m)
## [1] 1 5 9
  • Returns the dimensions of a matrix:
dim(m)
## [1] 3 3
  • Returns the sum of columns of a matrix:
colSums(m)
## [1] 6 15 24
  • Returns the sum of rows of a matrix:
rowSums(m)
## [1] 12 15 18
  • The transpose of a matrix can be obtained using the following code:
t(m)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
  • Returns the determinant of a matrix:
det(m)
## [1] 0
  • The auto-values and auto-vectors of a matrix are obtained using the following code:
eigen(m)
## eigen() decomposition
## $values
## [1] 1.611684e+01 -1.116844e+00 -5.700691e-16
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.4645473 -0.8829060 0.4082483
## [2,] -0.5707955 -0.2395204 -0.8164966
## [3,] -0.6770438 0.4038651 0.4082483

Lists

If objects are arranged in an orderly manner, which makes them components, they are known as lists.

Creating lists

We can create a list using list() or by concatenating other lists:

x<- list(1:4,"book",TRUE, 1+4i)
x
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] "book"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i

Components will always be referred to by their referring numbers as they are ordered and numbered.

Accessing components and elements in a list

To access each component in a list, a double bracket should be used:

x[[1]]
## [1] 1 2 3 4

However, it is possible to access each element of a list as well:

x[[1]][2:4]
## [1] 2 3 4

Data frames

Data frames are special lists that can also store tabular values. However, there is a constraint on the length of elements in the lists: they all have to be of a similar length. You can consider every element in the list as columns, and their lengths can be considered as rows.

Just like lists, a data frame can have objects belonging to different classes in a column; this was not allowed in matrices.

Let's quickly create a data frame using the data.frame() function:

a <- c(1, 3, 5)
b <- c("red", "yellow", "blue")
c <- c(TRUE, FALSE, TRUE)
df <- data.frame(a, b, c)
df
## a b c
## 1 red TRUE
## 3 yellow FALSE
## 5 blue TRUE

You can see the headers of a table as a, b, and c; they are the column names. Every line of the table represents a row, starting with the name of each row.

Accessing elements in data frames

It is possible to access each cell in the table.

To do this, you should specify the coordinates of the desired cell. Coordinates begin within the position of the row and end with the position of the column:

df[2,1]
## [1] 3

We can even use the row and column names instead of numeric values:

df[,"a"]
## [1] 1 3 5

Some packages contain datasets that can be loaded to the workspace, for example, the iris dataset:

data(iris)

Functions of data frames

Some functions can be used on data frames:

  • To find out the number of columns in a data frame:
ncol(iris)
## [1] 5
  • To obtain the number of rows:
nrow(iris)
## [1] 150
  • To print the first 10 rows of data:
head(iris,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa

  • Print the last 5 rows of the iris dataset:
tail(iris,5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
  • Finally, general information of the entire dataset is obtained using str():
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Although there are a lot of operations to work with data frames, such as merging, combining, or slicing, we won't go any deeper for now. We will be using data frames in further chapters, and shall cover more operations later.

Importing or exporting data

In R, there are several functions for reading and writing data from many sources and formats. Importing data into R is quite simple.

The most common files to import into R are Excel or text files. Nevertheless, in R, it is also possible to read files in SPSS, SYSTAT, or SAS formats, among others.

In the case of Stata and SYSTAT files, I would recommend the use of the foreign package.

Let's install and load the foreign package:

install.packages("foreign")
library(foreign)

We can use the Hmisc package for SPSS, and SAS for ease and functionality:

install.packages("Hmisc")
library(Hmisc)

Let's see some examples of importing data:

  • Import a comma delimited text file. The first rows will have the variable names, and the comma is used as a separator:
mydata<-read.table("c:/mydata.csv", header=TRUE,sep=",", row.names="id")
  • To read an Excel file, you can either simply export it to a comma delimited file and then import it or use the xlsx package. Make sure that the first row comprises column names that are nothing but variables.
  • Let's read an Excel worksheet from a workbook, myexcel.xlsx:
library(xlsx)
mydata<-read.xlsx("c:/myexcel.xlsx", 1)
  • Now, we will read a concrete Excel sheet in an Excel file:
mydata<-read.xlsx("c:/myexcel.xlsx", sheetName= "mysheet")
  • Reading from the systat format:
library(foreign)
mydata<-read.systat("c:/mydata.dta")
  • Reading from the SPSS format:
    1. First, the file should be saved from SPSS in a transport format:
getfile=’c:/mydata.sav’ exportoutfile=’c:/mydata.por’
    1. Then, the file can be imported into R with the Hmisc package:
library(Hmisc)
mydata<-spss.get("c:/mydata.por", use.value.labels=TRUE)
  • To import a file from SAS, again, the dataset should be converted in SAS:
libname out xport ‘c:/mydata.xpt’; data out.mydata; set sasuser.mydata; run;
library(Hmisc)
mydata<-sasxport.get("c:/mydata.xpt")
  • Reading from the Stata format:
library(foreign)
mydata<-read.dta("c:/mydata.dta")

Hence, we have seen how easy it is to read data from different file formats. Let's see how simple exporting data is.

There are analogous functions to export data from R to other formats. For SAS, SPSS, and Stata, the foreign package can be used. For Excel, you will need the xlsx package.

Here are a few exporting examples:

  • We can export data to a tab delimited text file like this:
write.table(mydata, "c:/mydata.txt", sep="\t")
  • We can export to an Excel spreadsheet like this:
library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")
  • We can export to SPSS like this:
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps", package="SPSS")
  • We can export to SAS like this:
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas", package="SAS")
  • We can export to Stata like this:
library(foreign)
write.dta(mydata, "c:/mydata.dta")

Working with functions

Functions are the core of R, and they are useful to structure and modularize code. We have already seen some functions in the preceding section. These functions can be considered built-in functions that are available on the basis of R or where we install some packages.

On the other hand, we can define and create our own functions based on different operations and computations we want to perform on the data. We will create functions in R using the function() directive, and these functions will be stored as objects in R.

Here is what the structure of a function in R looks like:

myfunction <- function(arg1, arg2, … )
{
statements
return(object)
}

The objects specified under a function as local to that function and the resulting objects can have any data type. We can even pass these functions as arguments for other functions.

Functions in R support nesting, which means that we can define a function within a function and the code will work just fine.

The resulting value of a function is known as the last expression evaluated on execution.

Once a function is defined, we can use that function using its name and passing the required arguments.

Let's create a function named squaredNum, which calculates the square value of a number:

squaredNum<-function(number)
{
a<-number^2
return(a)
}

Now, we can calculate the square of any number using the function that we just created:

squaredNum(425)
## [1] 180625

As we move on in this book, we will see how important such user-defined functions are.