Book Image

Mastering Machine Learning with R

By : Cory Lesmeister
Book Image

Mastering Machine Learning with R

By: Cory Lesmeister

Overview of this book

Table of Contents (20 chapters)
Mastering Machine Learning with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Summary stats


We will now cover some basic measures of central tendency, dispersion, and simple plots. The first question that we will address is how R handles the missing values in calculations? To see what happens, create a vector with a missing value (NA in the R language), then sum the values of the vector with sum():

> a = c(1,2,3,NA)

> sum(a)
[1] NA

Unlike SAS, which would sum the non-missing values, R does not sum the non-missing values but simply returns that at least one value is missing. Now, we could create a new vector with the missing value deleted but you can also include the syntax to exclude any missing values with na.rm=TRUE:

> sum(a, na.rm=TRUE)
[1] 6

Functions exist to identify the measures of central tendency and dispersion of a vector:

> data = c(4,3,2,5.5,7.8,9,14,20)

> mean(data)
[1] 8.1625

> median(data)
[1] 6.65

> sd(data)
[1] 6.142112

> max(data)
[1] 20

> min(data)
[1] 2

> range(data)
[1]  2 20

> quantile(data)
   0%   25%   50%   75%  100% 
 2.00  3.75  6.65 10.25 20.00 

A summary() function is available that includes the mean, median, and quartile values:

> summary(data)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   3.750   6.650   8.162  10.250  20.000

We can use plots to visualize the data. The base plot here will be barplot, then we will use abline() to include the mean and median. As the default line is solid, we will create a dotted line for median with lty=2 to distinguish it from mean:

> barplot(data)

> abline(h=mean(data))

> abline(h=median(data), lty=2)

The output of the preceding command is as follows:

A number of functions are available to generate different data distributions. Here, we can look at one such function for a normal distribution with a mean of zero and standard deviation of one using rnorm() to create 100 data points. We will then plot the values and also plot a histogram. Additionally, to duplicate the results, ensure that you use the same random seed with set.seed():

> set.seed(1)

> norm = rnorm(100)

This is the plot of the 100 data points:

> plot(norm)

The output of the preceding command is as follows:

Finally, produce a histogram with hist(norm):

> hist(norm)

The following is the output of the preceding command: