The dataset for analysis here is DNA
pulled from mlbench
. You don't have to install the package as I've put it in a CSV file and placed it on GitHub: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/dna.csv.
Install the packages as needed and load the data:
> library(magrittr) > install.packages("earth") > install.packages("glmnet") > install.packages("mlr") > install.packages("randomForest") > install.packages("tidyverse") dna <- read.csv("dna.csv")
The data consists of 3,181 observations, 180 input features coded as binary indicators, and the Class
response. The response is a factor with three labels indicating a DNA type either ei
, ie
, or neither—coded as n
. The following is a table of the target labels:
> table(dna$Class) ei ie n 767 765 1654
This data should be ready for analysis, but let's run some quick checks to verify, starting with missing values:
> na_count <- sapply(dna, function...