Book Image

R for Data Science

By : Dan Toomey
Book Image

R for Data Science

By: Dan Toomey

Overview of this book

Table of Contents (19 chapters)

Dataset


Machine learning works by featuring a dataset that we break up into a training section and a testing section. We use the training data to come up with our model. We can then prove or test that model against the remaining testing section data.

The first issue is finding a dataset with several variables and, hopefully, several hundred observations. I am using the housing data from http://uci.edu. Let's find the dataset using the following command:

> housing <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data")
> colnames(housing) <- c("CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PRATIO","B","LSTAT","MDEV")

There are close to 500 observations with 14 variables. We can see a summary for a better idea, as follows:

> summary(housing)
      CRIM                ZN             INDUS            CHAS        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu...