Book Image

Machine Learning with R Cookbook, Second Edition - Second Edition

By : Yu-Wei, Chiu (David Chiu)
Book Image

Machine Learning with R Cookbook, Second Edition - Second Edition

By: Yu-Wei, Chiu (David Chiu)

Overview of this book

Big data has become a popular buzzword across many industries. An increasing number of people have been exposed to the term and are looking at how to leverage big data in their own businesses, to improve sales and profitability. However, collecting, aggregating, and visualizing data is just one part of the equation. Being able to extract useful information from data is another task, and a much more challenging one. Machine Learning with R Cookbook, Second Edition uses a practical approach to teach you how to perform machine learning with R. Each chapter is divided into several simple recipes. Through the step-by-step instructions provided in each recipe, you will be able to construct a predictive model by using a variety of machine learning packages. In this book, you will first learn to set up the R environment and use simple R commands to explore data. The next topic covers how to perform statistical analysis with machine learning analysis and assess created models, covered in detail later on in the book. You'll also learn how to integrate R and Hadoop to create a big data analysis platform. The detailed illustrations provide all the information required to start applying machine learning to individual projects. With Machine Learning with R Cookbook, machine learning has never been easier.
Table of Contents (21 chapters)
Title Page
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Applying basic statistics


R provides a wide range of statistical functions, allowing users to obtain the summary statistics of data, generate frequency and contingency tables, produce correlations, and conduct statistical inferences. This recipe covers basic statistics that can be applied to a dataset.

Getting ready

Ensure you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to apply statistics to a dataset:

  1. Load the iris data into an R session:
> data(iris)
  1. Observe the format of the data:
        > class(iris)
        [1] "data.frame"  
  1. The iris dataset is a DataFrame containing four numeric attributes: petal length, petal width, sepal width, and sepal length. For numeric values, you can perform descriptive statistics, such as mean, sd, var, min, max, median, range, and quantile. These can be applied to any of the four attributes in the dataset:
        > mean(iris$Sepal.Length)
        Output:
        [1] 5.843333
        > sd(iris$Sepal.Length)
        Output:
        [1] 0.8280661
        > var(iris$Sepal.Length)
        Output:
        [1] 0.6856935
        > min(iris$Sepal.Length)
        Output:
        [1] 4.3
        > max(iris$Sepal.Length)
        Output:
        [1] 7.9
        > median(iris$Sepal.Length)
        Output:
        [1] 5.8
        > range(iris$Sepal.Length)
        Output:
        [1] 4.3 7.9
        > quantile(iris$Sepal.Length)
        Output:
        0%  25%  50%  75% 100% 
        4.3  5.1  5.8  6.4  7.9
  1. The preceding example demonstrates how to apply descriptive statistics to a single variable. In order to obtain summary statistics on every numeric attribute of the DataFrame, one may use sapply. For example, to apply the mean on the first four attributes in the iris DataFrame, ignore the na value by setting na.rm as TRUE:
        > sapply(iris[1:4], mean, na.rm=TRUE)
        Output:
        Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
          5.843333     3.057333     3.758000     1.199333 
    
  
  1. As an alternative to using sapply to apply descriptive statistics on given attributes, R offers the summary function that provides a full range of descriptive statistics. In the following example, the summary function provides the mean, median, 25th and 75th quartiles, min, and max of every iris dataset numeric attribute:
        > summary(iris)
        Output:
        Sepal.Length  Sepal.Width   Petal.Length   Petal.Width  Species  
        Min.  4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa    :50  
        1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
        versicolor:50  
        Median :5.800 Median :3.000 Median :4.350 Median :1.300 
        virginica :50  
        Mean :5.843 Mean   :3.057 Mean   :3.758 Mean   :1.199                  
        3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800                  
        Max. :7.900 Max.   :4.400 Max.   :6.900 Max.   :2.500
  1. The preceding example shows how to output the descriptive statistics of a single variable. R also provides the correlation for users to investigate the relationship between variables. The following example generates a 4x4 matrix by computing the correlation of each attribute pair within the iris:
        > cor(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
        Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
        Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
        Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
  1. R also provides a function to compute the covariance of each attribute pair within the iris dataset:
        > cov(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
        Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
        Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
        Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063
  1. Statistical tests are performed to access the significance of the results; here we demonstrate how to use a t-test to determine the statistical differences between two samples. In this example, we perform a t.test on the petal width an of an iris in either the setosa or versicolor species. If we obtain a p-value less than 0.5, we can be certain that the petal width between the setosa and versicolor will vary significantly:
        > t.test(iris$Petal.Width[iris$Species=="setosa"], 
        +        iris$Petal.Width[iris$Species=="versicolor"])
        Output:
        
        Welch Two Sample t-test
        
        data:  iris$Petal.Width[iris$Species == "setosa"] and 
        iris$Petal.Width[iris$Species == "versicolor"]
        t = -34.0803, df = 74.755, p-value < 2.2e-16
        alternative hypothesis: true difference in means is not equal to 0
        95 percent confidence interval:
       -1.143133 -1.016867
        sample estimates:
        mean of x mean of y 
        0.246     1.326
  1. Alternatively, you can perform a correlation test on the sepal length to the sepal width of an iris, and then retrieve a correlation score between the two variables. The stronger the positive correlation, the closer the value is to 1. The stronger the negative correlation, the closer the value is to -1:
        > cor.test(iris$Sepal.Length, iris$Sepal.Width)
        Output:
        Pearson's product-moment correlation
        data:  iris$Sepal.Length and iris$Sepal.Width
        t = -1.4403, df = 148, p-value = 0.1519
        alternative hypothesis: true correlation is not equal to 0
        95 percent confidence interval:
       -0.27269325  0.04351158
        sample estimates:
            cor 
       -0.1175698   

How it works...

R has a built-in statistics function, which enables the user to perform descriptive statistics on a single variable. The recipe first introduces how to apply mean, sd, var, min, max, median, range, and quantile on a single variable. Moreover, in order to apply the statistics on all four numeric variables, one can use the sapply function. In order to determine the relationships between multiple variables, one can conduct correlation and covariance. Finally, the recipe shows how to determine the statistical differences of two given samples by performing a statistical test.

There's more...

If you need to compute an aggregated summary of statistics against data in different groups, you can use the aggregate and reshape functions to compute the summary statistics of data subsets:

  1. Use aggregate to calculate the mean of each iris attribute group by the species:
        > aggregate(x=iris[,1:4],by=list(iris$Species),FUN=mean)  
  1. Use reshape to calculate the mean of each iris attribute group by the species:
        >  library(reshape)
        >  iris.melt <- melt(iris,id='Species')
        >  cast(Species~variable,data=iris.melt,mean,
             subset=Species %in% c('setosa','versicolor'),
             margins='grand_row')

For information on reshape and aggregate, refer to the help documents by using ?reshape or ?aggregate.