Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying R Data Analysis Cookbook, Second Edition
  • Table Of Contents Toc
R Data Analysis Cookbook, Second Edition

R Data Analysis Cookbook, Second Edition - Second Edition

By : Kuntal Ganguly, Viswanathan, Viswa Viswanathan
3.3 (4)
close
close
R Data Analysis Cookbook, Second Edition

R Data Analysis Cookbook, Second Edition

3.3 (4)
By: Kuntal Ganguly, Viswanathan, Viswa Viswanathan

Overview of this book

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book will show you how you can put your data analysis skills in R to practical use, with recipes catering to the basic as well as advanced data analysis tasks. Right from acquiring your data and preparing it for analysis to the more complex data analysis techniques, the book will show you how you can implement each technique in the best possible manner. You will also visualize your data using the popular R packages like ggplot2 and gain hidden insights from it. Starting with implementing the basic data analysis concepts like handling your data to creating basic plots, you will master the more advanced data analysis techniques like performing cluster analysis, and generating effective analysis reports and visualizations. Throughout the book, you will get to know the common problems and obstacles you might encounter while implementing each of the data analysis techniques in R, with ways to overcoming them in the easiest possible way. By the end of this book, you will have all the knowledge you need to become an expert in data analysis with R, and put your skills to test in real-world scenarios.
Table of Contents (14 chapters)
close
close

Normalizing or standardizing data in a data frame

Distance computations play a big role in many data analytics techniques. We know that variables with higher values tend to dominate distance computations and you may want to use the standardized (or z) values.

Getting ready

Download the BostonHousing.csv data file and store it in your R environment's working directory. Then read the data:

> housing <- read.csv("BostonHousing.csv") 

How to do it...

To standardize all the variables in a data frame containing only numeric variables, use:

> housing.z <- scale(housing) 

You can only use the scale() function on data frames that contain all numeric variables. Otherwise, you will get an error.

How it works...

When invoked in the preceding example, the scale() function computes the standard z score for each value (ignoring NAs) of each variable. That is, from each value it subtracts the mean and divides the result by the standard deviation of the associated variable.

The scale() function takes two optional arguments, center and scale, whose default values are TRUE. The following table shows the effect of these arguments:

Argument

Effect

center = TRUE, scale = TRUE

Default behavior described earlier

center = TRUE, scale = FALSE

From each value, subtract the mean of the concerned variable

center = FALSE, scale = TRUE

Divide each value by the root mean square of the associated variable, where root mean square is sqrt(sum(x^2)/(n-1))

center = FALSE, scale = FALSE

Return the original values unchanged

There's more...

When using distance-based techniques, you may need to rescale several variables. You may find it tedious to standardize one variable at a time.

Standardizing several variables simultaneously

If you have a data frame with some numeric and some non-numeric variables, or want to standardize only some of the variables in a fully numeric data frame, then you can either handle each variable separately, which would be cumbersome, or use a function such as the following to handle a subset of variables:

scale.many <- function(dat, column.nos) { 
nms <- names(dat)
for(col in column.nos) {
name <- paste(nms[col],".z", sep = "")
dat[name] <- scale(dat[,col])
}
cat(paste("Scaled ", length(column.nos), " variable(s)n"))
dat
}

With this function, you can now do things like:

> housing <- read.csv("BostonHousing.csv") 
> housing <- scale.many(housing, c(1,3,5:7))

This will add the z values for variables 1, 3, 5, 6, and 7, with .z appended to the original column names:

> names(housing) 

[1] "CRIM" "ZN" "INDUS" "CHAS" "NOX" "RM"
[7] "AGE" "DIS" "RAD" "TAX" "PTRATIO" "B"
[13] "LSTAT" "MEDV" "CRIM.z" "INDUS.z" "NOX.z" "RM.z"
[19] "AGE.z"

See also

Rescaling a variable to [0,1] recipe in this chapter.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
R Data Analysis Cookbook, Second Edition
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon