Book Image

R Data Analysis Cookbook - Second Edition

By : Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan
Book Image

R Data Analysis Cookbook - Second Edition

By: Kuntal Ganguly, Shanthi Viswanathan, Viswa Viswanathan

Overview of this book

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way. By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.
Table of Contents (14 chapters)

Correcting data

In practice, raw data is rarely tidy, and is much harder to work with as a result. It is often said that 80 percent of data analysis is spent on the process of cleaning and correcting the data.

In this recipe, you will learn the best way to correctly layout your data to serve two major purposes:

  • Making data suitable for software processing, whether that be mathematical functions, visualization, and others
  • Revealing information and insights

Getting ready

Download the files for this chapter and store the USArrests.csv file in your R working directory. You should also install the tidyr package using the following command.

> install.packages("tidyr")
> library(tidyr)
> crimeData <- read.csv("USArrests.csv",stringsAsFactors = FALSE)

How to do it...

Follow these steps to correct the data from your dataset.

  1. View some records of the dataset:
> View(crimeData)
  1. Add the column name state in the dataset:
> crimeData <- cbind(state = rownames(crimeData), crimeData)
  1. Gather all the variables between Murder and UrbanPop:
> crimeData.1 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
Murder:UrbanPop)

> crimeData.1
  1. Gather all the columns except the column state:
> crimeData.2 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
-state)

> crimeData.2
  1. Gather only the Murder and Assault columns:
> crimeData.3 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
Murder, Assault)

> crimeData.3
  1. Spread crimeData.2 to turn a pair of key:value (crime_typ:arrest_estimate) columns into a set of tidy columns
> crimeData.4 <- spread(crimeData.2,
key = "crime_type",
value = "arrest_estimate"
)
> crimeData.4

How it works...

Correct data format is crucial for facilitating the tasks of data analysis, including data manipulation, modeling, and visualization. The tidy data arranges values so that the relationships in the data parallel the structure of the data frame. Every tidy dataset is based on two basic principles:

  • Each variable is saved in its own column
  • Each observation is saved in its own row

In the crimeData dataframe, the row names were states, hence we used the function cbind() to add a column named state in the dataframe. The function gather() collapses multiple columns into key-value pairs. It makes wide data longer. The gather() function basically takes four arguments, data (dataframe), key (column name representing new variable), value (column name representing variable values), and names of the columns to gather (or not gather).

In the crimeData.2 data, all column names (except state) were collapsed into a single key column crime_type and their values were put into a value column arrest_estimate.

And, in the crimeData.3 data, the two columns Murder and Assault were collapsed and the remaining columns (state, UrbanPop, and Rape) were
duplicated.

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads them into multiple columns. It makes long data wider. The spread() function takes three arguments in general, data (dataframe), key (column values to convert to multiple columns), and value (single column value to convert to multiple columns' values).

There's more...

Beside the spread() and gather() functions, there are two more important functions in the tidyr package that help to make data tidy.

Combining multiple columns to single columns

The unite() function takes multiple columns and pastes them together into one column:

> crimeData.5 <- unite(crimeData,
col = "Murder_Assault",
Murder, Assault,
sep = "_")
> crimeData.5

We combine the columns Murder and Assault from the crimeData
data-frame to generate a new column Murder_Assault, having the values separated by _.

Splitting single column to multiple columns

The separate() function is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns:

> crimeData.6 <- separate_(crimeData.5,
col = "Murder_Assault",
into = c("Murder", "Assault"),
sep = "_")
>crimeData.6