Book Image

Mastering Data analysis with R

By : Gergely Daróczi
Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Table of Contents (19 chapters)
Mastering Data Analysis with R
Credits
www.PacktPub.com
Preface

Loading text files of a reasonable size


The title of this chapter might also be Hello, Big Data!, as now we concentrate on loading relatively large amount of data in an R session. But what is Big Data, and what amount of data is problematic to handle in R? What is reasonable size?

R was designed to process data that fits in the physical memory of a single computer. So handling datasets that are smaller than the actual accessible RAM should be fine. But please note that the memory required to process data might become larger while doing some computations, such as principal component analysis, which should be also taken into account. I will refer to this amount of data as reasonable sized datasets.

Loading data from text files is pretty simple with R, and loading any reasonable sized dataset can be achieved by calling the good old read.table function. The only issue here might be the performance: how long does it take to read, for example, a quarter of a million rows of data? Let's see:

> library('hflights')
> write.csv(hflights, 'hflights.csv', row.names = FALSE)

Note

As a reminder, please note that all R commands and the returned output are formatted as earlier in this book. The commands starts with > on the first line, and the remainder of multi-line expressions starts with +, just as in the R console. To copy and paste these commands on your machine, please download the code examples from the Packt homepage. For more details, please see the What you need for this book section in the Preface.

Yes, we have just written an 18.5 MB text file to your disk from the hflights package, which includes some data on all flights departing from Houston in 2011:

> str(hflights)
'data.frame':  227496 obs. of  21 variables:
 $ Year             : int  2011 2011 2011 2011 2011 2011 2011 ...
 $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
 $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 ...
 $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 ...
 $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
 $ FlightNum        : int  428 428 428 428 428 428 428 428 428 ...
 $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
 $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
 $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
 $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
 $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
 $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
 $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
 $ Distance         : int  224 224 224 224 224 224 224 224 224 ...
 $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
 $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
 $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CancellationCode : chr  "" "" "" "" ...
 $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...

Note

The hflights package provides an easy way to load a subset of the huge Airline Dataset of the Research and Innovation Technology Administration at the Bureau of Transportation Statistics. The original database includes the scheduled and actual departure/arrival times of all US flights along with some other interesting information since 1987, and is often used to demonstrate machine learning and Big Data technologies. For more details on the dataset, please see the column description and other meta-data at http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0.

We will use this 21-column data to benchmark data import times. For example, let's see how long it takes to import the CSV file with read.csv:

> system.time(read.csv('hflights.csv'))
   user  system elapsed 
  1.730   0.007   1.738

It took a bit more than one and a half seconds to load the data from an SSD here. It's quite okay, but we can achieve far better results by identifying then specifying the classes of the columns instead of calling the default type.convert (see the docs in read.table for more details or search on StackOverflow, where the performance of read.csv seems to be a rather frequent and popular question):

> colClasses <- sapply(hflights, class)
> system.time(read.csv('hflights.csv', colClasses = colClasses))
   user  system elapsed 
  1.093   0.000   1.092

It's much better! But should we trust this one observation? On our way to mastering data analysis in R, we should implement some more reliable tests—by simply replicating the task n times and providing a summary on the results of the simulation. This approach provides us with performance data with multiple observations, which can be used to identify statistically significant differences in the results. The microbenchmark package provides a nice framework for such tasks:

> library(microbenchmark)
> f <- function() read.csv('hflights.csv')
> g <- function() read.csv('hflights.csv', colClasses = colClasses,
+                        nrows = 227496, comment.char = '')
> res <- microbenchmark(f(), g())
> res
Unit: milliseconds
 expr       min        lq   median       uq      max neval
  f() 1552.3383 1617.8611 1646.524 1708.393 2185.565   100
  g()  928.2675  957.3842  989.467 1044.571 1284.351   100

So we defined two functions: f stands for the default settings of read.csv while, in the g function, we passed the aforementioned column classes along with two other parameters for increased performance. The comment.char argument tells R not to look for comments in the imported data file, while the nrows parameter defined the exact number of rows to read from the file, which saves some time and space on memory allocation. Setting stringsAsFactors to FALSE might also speed up importing a bit.

Note

Identifying the number of lines in the text file could be done with some third-party tools, such as wc on Unix, or a slightly slower alternative would be the countLines function from the R.utils package.

But back to the results. Let's also visualize the median and related descriptive statistics of the test cases, which was run 100 times by default:

> boxplot(res, xlab  = '',
+   main = expression(paste('Benchmarking ', italic('read.table'))))

The difference seems to be significant (please feel free to do some statistical tests to verify that), so we made a 50+ percent performance boost simply by fine-tuning the parameters of read.table.

Data files larger than the physical memory

Loading a larger amount of data into R from CSV files that would not fit in the memory could be done with custom packages created for such cases. For example, both the sqldf package and the ff package have their own solutions to load data from chunk to chunk in a custom data format. The first uses SQLite or another SQL-like database backend, while the latter creates a custom data frame with the ffdf class that can be stored on disk. The bigmemory package provides a similar approach. Usage examples (to be benchmarked) later:

> library(sqldf)
> system.time(read.csv.sql('hflights.csv'))
   user  system elapsed 
  2.293   0.090   2.384 
> library(ff)
> system.time(read.csv.ffdf(file = 'hflights.csv'))
   user  system elapsed 
  1.854   0.073   1.918 
> library(bigmemory)
> system.time(read.big.matrix('hflights.csv', header = TRUE))
   user  system elapsed 
  1.547   0.010   1.559

Please note that the header defaults to FALSE with read.big.matrix from the bigmemory package, so be sure to read the manual of the referenced functions before doing your own benchmarks. Some of these functions also support performance tuning like read.table. For further examples and use cases, please see the Large memory and out-of-memory data section of the High-Performance and Parallel Computing with R CRAN Task View at http://cran.r-project.org/web/views/HighPerformanceComputing.html.