Book Image

Data Manipulation with R - Second Edition

By : Jaynal Abedin, Kishor Kumar Das
Book Image

Data Manipulation with R - Second Edition

By: Jaynal Abedin, Kishor Kumar Das

Overview of this book

<p>This book starts with the installation of R and how to go about using R and its libraries. We then discuss the mode of R objects and its classes and then highlight different R data types with their basic operations.</p> <p>The primary focus on group-wise data manipulation with the split-apply-combine strategy has been explained with specific examples. The book also contains coverage of some specific libraries such as lubridate, reshape2, plyr, dplyr, stringr, and sqldf. You will not only learn about group-wise data manipulation, but also learn how to efficiently handle date, string, and factor variables along with different layouts of datasets using the reshape2 package.</p> <p>By the end of this book, you will have learned about text manipulation using stringr, how to extract data from twitter using twitteR library, how to clean raw data, and how to structure your raw data for data mining.</p>
Table of Contents (13 chapters)
Data Manipulation with R Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

The R object structure and mode conversion


When we work with any statistical software, such as R, we rarely use single values for an object. We need to know how we can handle a collection of data values (for example, the age of 100 randomly selected diabetic patients), along with what type of objects are needed to store these data values. In R, the most convenient way to store more than one data value is vector (a collection of data values stored in a single object is known as a vector: for example, storing the ages of 100 diabetic patients in a single object). In fact, whenever we create an R object, it stores the values as a vector. It could be a single-element vector or a multiple-element vector. The num.obj vector we created in the previous section is a kind of vector that comprises numeric elements.

One of the simplest ways to create a vector in R is to use the c() function. Here is an example:

# creating vector of numeric element with "c" function
> num.vec <- c(1,3,5,7)
> num.vec
[1] 1 3 5 7
> mode(num.vec)
[1] "numeric"
> class(num.vec)
[1] "numeric"
> is.vector(num.vec)
[1] TRUE

If we create a vector with mixed elements (character and numeric), the resulting vector will be a character vector. Here is an example:

# Vector with mixed elements 
> num.char.vec <- c(1,3,"five",7)
> num.char.vec
[1] "1"    "3"    "five" "7"   
> mode(num.char.vec)
[1] "character"
> class(num.char.vec)
[1] "character"
> is.vector(num.char.vec)
[1] TRUE

We can create a big new vector by combining multiple vectors, and the resulting vector's mode will be character, if any element of any vector contains a character. The vector can be named, or it can be without a name. In the previous example, vectors were without names.

The following example shows how we can create a vector with the name of each element:

# combining multiple vectors
> comb.vec <- c(num.vec,num.char.vec)
> mode(comb.vec)
[1] "character"

# creating named vector
> named.num.vec <- c(x1=1,x2=3,x3=5)
> named.num.vec
x1 x2 x3 
1  3  5

The name of the elements in a vector can be assigned separately using the names() command. In R, any single constant is also stored as a vector of the single element.

Here is an example:

# vector of single element
> unit.vec <- 9
> is.vector(unit.vec)
[1] TRUE

R has six basic storage types of vectors, and each type is known as an atomic vector.

The following table shows the six basic vector types, their mode, and the storage mode:

Type

Mode

Storage mode

logical

logical

logical

integer

numeric

integer

double

numeric

double

complex

complex

complex

character

character

character

raw

raw

raw

Other than vectors, there are different storage types available in R to handle data with multiple elements; these are matrix, data frame, and list. We will discuss each of these types in subsequent sections.

To convert the object mode, R has user-friendly functions that can be depicted as as.x. Here, x could be numeric, logical, character, list, data frame, and so on. For example, if we need to perform a matrix operation that requires numeric mode, and the data is stored in some other mode, the operation cannot be performed. In this case, we need to convert that data into numeric mode.

In the following example, we will see how we can convert an object's mode:

# creating a vector of numbers and then converting it to logical # and character
> numbers.vec <- c(-3,-2,-1,0,1,2,3)
> numbers.vec
[1] -3 -2 -1  0  1  2  3
> num2char <- as.character(numbers.vec)
> num2char
[1] "-3" "-2" "-1" "0"  "1"  "2"  "3"
> num2logical <- as.logical(numbers.vec)
> num2logical
[1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

# creating character vector and then convert it to numeric and logical
> char.vec <- c("1","3","five","7")
> char.vec
[1] "1"    "3"    "five" "7"   
> char2num <- as.numeric(char.vec)
Warning message:
NAs introduced by coercion 
> char2num
[1]  1  3 NA  7
> char2logical <- as.logical(char.vec)
> char2logical
[1] NA NA NA NA

# logical to character conversion
> logical.vec <- c(TRUE, FALSE, FALSE,  TRUE,  TRUE)
> logical.vec
[1]  TRUE FALSE FALSE  TRUE  TRUE
> logical2char <- as.character(logical.vec)
> logical2char
[1] "TRUE"  "FALSE" "FALSE" "TRUE"  "TRUE"

Note that, when we convert numeric mode to logical mode, only 0 (zero) gets FALSE, and all the other values get TRUE. Also, if we convert a character object to numeric, it produces numeric elements and NA (if any actual character is present), where a warning will be issued. Importantly, R does not convert a character object into a logical object but, if we try to do this, all the resulting elements will be NA. However, logical objects get successfully converted to character objects.

Finally, we can say that any object can be converted to a character without offering any warning. However, if we want to convert character objects to any other type, we have to be careful.

Vector

R is a domain-specific programming language, specially designed to perform statistical analysis on data. In statistics, when we analyze data, the first thing that comes to mind is a variable with hundreds of observations in it. This reminds us of the picture of a vector. Probably, this is the main reason why, in R, the most elementary data type is a vector. A vector is a contiguous cell that contains data, where each cell can be accessed by an index:

> age <- c(10,20,30,40)

This is an example of a vector. The age of five individuals is stored in the age vector. Pay attention to how the vector was formed and stored under the age variable. Here, c() is a function used to create a vector, but this does not store all the data in the system. <- is called an assignment operator that is used to store a vector under a variable.

Now, in the console, let's type the following line and press Enter:

> age
 [1] 10 20 30 40

We successfully stored all the ages under the age variable, but what is [1]? This means that the index of the value 10 is 1.

If you want to see the first values of the vector, type the following command:

> age[3]
[1] 30

Why did R only show the index of the first value and not the other values? This is only to keep the output clean and informative. Every time R writes a new line, it first gives the index number of the next value. Pretty soon, you will be familiar with this convention. We can store a single value under a variable, but it will be a vector with one element:

> height<- 175

To show you that height is not a scalar but a vector with one element, we will store one additional value in it:

> height[2]<- 180

Pay attention to how we added another value inside an existing vector. Here, we put 180 in the second cell of the vector. Can you recall how we accessed the value in the second cell for the age variable? Using age[2], right? Similarly, we can assign a value to the second cell of the vector using the same syntax. Let's try to put another value inside the height variable:

> height[3] <- 165

Now, we can see all the values stored inside the height variable:

> height
[1] 175 180 165   

Although the basic data structure in R is vectors, there can be different types of vector. We use a numeric vector to store numeric data such as age, height, weight, and so on. Character vectors are used to store string data such as name, address, and so on. The way we can define a character vector in R is simple:

> name<- c("Rob", "Bob", "Jude","Monica")

When we want to store a character in R, we need to use double quotes, as used in the previous example. This tells R that this is a string input. We can put numeric values using double quotes but, if we use a character without double quotes, then it will return an error message.

Another special type of vector is the logical vector. There are two ways we could define a logical vector; first, we will show you the more formal way and, second, we will show you the quick way. There can be two possible elements in a logical vector: TRUE and FALSE. This logical vector is used in logical operations in R. It can be used to select specific rows from a dataset.

We can define a logical vector in the following way:

> logical<- c(TRUE, FALSE, TRUE, FALSE)

This logical vector can be used as a row selector of the age vector in the following way:

> age[logical]
[1] 10 30

Look closely to find out what we just did. We have seen how we can extract age from a vector using indexing. A logical vector can be thought of as a vector of an index. The first element of the logical vector is TRUE, which means that the first element of the age vector will be selected. The second element of the logical vector is FALSE. This means that the second element of the age vector will not be selected. So, the logical vector will select only the elements of the age vector for which the logical vector is TRUE. So, finally, two elements of the age vector will be selected, and a vector of two elements will be returned. A question that may come to your mind is, What can we do with this feature? The answer will be clearer in the Data frame section.