Book Image

Mastering Scientific Computing with R

Book Image

Mastering Scientific Computing with R

Overview of this book

Table of Contents (17 chapters)
Mastering Scientific Computing with R
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Loading data into R


There are several ways to load data into R. The most common way is to enter data using the read.table() function or one of its derivatives, read.csv() for the .csv files, or read.delim() for .txt files. You can also directly upload Excel data in the .xls or .xlsx format using the gdata or XLConnect package. Other file formats such as Minitab Portable Worksheet (.mtp) and SPSS (.spss) files can also be opened using the foreign package.

To download a package from within R, you can use the install.packages() function as follows:

> install.packages(pkgname.tar.gz, repos = NULL, type = "source" )

Next, load the package (otherwise known as a library) using the library() or require() function. The require() function is designed to use in functions because it returns FALSE and a warning message, instead of the error message that the library() returns when the package is missing. You only need to load a package once per R session.

The first thing to do before loading a file is to make sure that R is in the right working directory. You can see where R will read and save files, by default, using the getwd() function. Then, you can change it using the setwd() function. You should use the full path when setting the working directory because it is easier to avoid unwanted error messages such as Error in setwd("new_directory") : cannot change working directory.

For example, execute the following function on a Mac operating system:

> getwd()
[1] "/Users/johnsonR/"
> setwd("/Users/johnsonR/myDirectory")

To work with data in the C: drive in the myDirectory folder on a Windows version of R, you will need to set the working directory as follows:

> setwd("C:/myDirectory")

Then, you can use the read.table() function to load your data as follows:

#To specify that the file is a tab delimited text file we use the sep argument with "\t"
> myData.df <- read.table("myData.txt", header=TRUE, sep="\t")
> myData.df 
   A  B C
1 12  6 8
2  4  9 2
3  5 13 3

Alternatively, you could use the read.delim() function instead as follows:

> read.delim("myData.txt", header=TRUE)
   A  B C
1 12  6 8
2  4  9 2
3  5 13 3
> myData2.df <-read.csv("myData.csv", header=FALSE)
> myData2.df
  V1 V2 V3
1  A  B  C
2 12  6  8
3  4  9  2
4  5 13  3

By default, these functions return data frames with all string-containing columns converted to factors unless you set stringsAsFactors=FALSE in read.table(), read.delim(), and read.csv(). Let's take a look at an example:

> str(myData2.df)
'data.frame':  4 obs. of  3 variables:
 $ V1: Factor w/ 4 levels "12","4","5","A": 4 1 2 3
 $ V2: Factor w/ 4 levels "13","6","9","B": 4 2 3 1
 $ V3: Factor w/ 4 levels "2","3","8","C": 4 3 1 2
> myData2.df <-read.csv("myData.csv", header=FALSE, stringsAsFactors=FALSE)
> str(myData2.df)
'data.frame':  4 obs. of  3 variables:
 $ V1: chr  "A" "12" "4" "5"
 $ V2: chr  "B" "6" "9" "13"
 $ V3: chr  "C" "8" "2" "3"

To upload Excel sheets using the gdata package, you load the package into R and then use the read.xls() function as follows:

> library("gdata")
> myData.df <- read.xls("myData.xlsx", sheet=1) #also uploads .xls files and returns a data frame

Alternatively, you could upload a complete workbook and read the worksheets separately using the XLConnect package as follows:

> library("XLConnect")
> myData.workbook <- loadWorkbook("myData.xlsx")
> myData3.df <- readWorksheet(myData.workbook, sheet="Sheet1")

To read the .mtp and .spss files, you will first load the foreign package, and then use the read.mtp() and read.spss() functions. By default, these functions return a list of components so you will have to convert the data into a data frame afterwards. Alternatively, for .spss files, the read.spss() function has a to.data.frame argument that allows it to return a data frame instead.

> myData4.df <- read.spss("myfile.spss", to.data.frame=TRUE) 

Saving data frames

To save an object, preferably a matrix or data frame, you can write a .txt file or a file using another delimiter using the write.table() function. You can choose to include row.names and col.names by setting these arguments to TRUE. The output file will be saved to your current directory. Note that the write.table() function often saves character vectors with quotation marks in the output file. So, I also suggest that you set the quote argument to FALSE to avoid seeing quotation marks should you open the file with a text editor. Let's take a look at a few examples:

> write.table(myData.df, file="savedata_file.txt", quote = FALSE, sep = "\t", row.names=TRUE, col.names=TRUE, append=FALSE) 

By default, there is no column name for a column of row names. So your output would look like this:

V1   V2   V3
1     A    B   C
2    12    6   8
3     4    9   2
4     5   13   3

To correct this problem to view in a spreadsheet viewer such as Excel, you can write the table setting as col.names=NA and row.names=TRUE, as follows:

> write.table(myData.df, file="savedata_file.txt", quote = FALSE, sep = "\t",col.names = NA, row.names = TRUE, append=FALSE)
    V1   V2   V3
1    A    B    C
2   12    6    8
3    4    9    2
4    5   13    3

Alternatively, you could use the write.csv() function, which has col.names=NA and row.names=TRUE set as defaults:

> write.csv(myData.df, file = "savedata_file.csv") #same output as above

If you would like to save a series of data frames in an Excel workbook, we recommend that you use the WriteXLS package, which greatly simplifies the task. Here is an example of the code you could use to save two data frames (df1 and df2) as two separate worksheets with the sheet names set as "df1_results" and "df2_results" in a file called combined_dfs_workbook.xls:

> library("WriteXLS")
> dfs.tosave <- c("df1", "df2")
> sheets.tosave <- c("df1_results", "df2_results")
> WriteXLS(dfs.tosave, ExcelFileName = "combined_dfs_workbook.xls", SheetNames = sheets.tosave)

You can also save and reload R objects for future sessions using the dump() and source() functions. For example, say you created several list objects containing important data for routine analysis. Saving a list object to a spreadsheet or .txt file can be difficult to reload afterwards, since most read functions return a data frame. A simpler way to proceed will be to save (or dump) the object to a file that R can reopen (source) in another session.

The following data shows how you can save that object:

> dump("myData.df", "myData.R")
> #Or if you would like to save all objects in your session:
> dump(list=objects(), "all_objects.R")

The myData.R file created will contain all the commands necessary to recreate that object in a future session. At a later date, you can retrieve the data as follows:

> source("mydata.R")

You can also use the save() and load() functions to save and retrieve your objects at a later time, as follows:

> save(myData.df, file="myData.R")
> load("myData.R")

A good alternative to the save() and load() functions are the saveRDS() and readRDS() functions, respectively. The saveRDS() function doesn't save the object and its name; instead, it just saves a representation of the object. Therefore, when you retrieve the data with the readRDS() function, you will need to store it in an object. However, unlike the save() function, you can only save one object at a time with the saveRDS() function. For example, to save the myData.df object and retrieve it later, you can execute the following lines of code:

# To save the object
> saveRDS(myData.df, "myData.rds")
# To load and save the object to a new object
> myData2 <- readRDS("myData.rds")

You can also redirect the R output to a file using the sink(file="filename") function as follows:

> sink("data_session1.txt")
> x<-c(1,2,3)
> y <-c(4,5,6)
> #This is a comment
> x+y #Note the sum of x+y is redirected to data_session1.txt

To stop redirecting the output to the file and print a new output to the screen, just run the sink() function again without any arguments as follows:

> sink()
> 3+4
[1] 7

When you open the data_session1.txt file, you will notice that only the result of the sum of x+y is saved to the file and not the commands or comments you entered.

The following is the output in the data_session1.txt file:

[1] 5 7 9

As you can see, comments and standard input aren't included in the output. Only the output is printed to the file specified in the sink() function.