Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Machine Learning with R Cookbook, Second Edition
  • Table Of Contents Toc
Machine Learning with R Cookbook, Second Edition

Machine Learning with R Cookbook, Second Edition - Second Edition

By : Yu-Wei, Chiu (David Chiu)
2 (1)
close
close
Machine Learning with R Cookbook, Second Edition

Machine Learning with R Cookbook, Second Edition

2 (1)
By: Yu-Wei, Chiu (David Chiu)

Overview of this book

Big data has become a popular buzzword across many industries. An increasing number of people have been exposed to the term and are looking at how to leverage big data in their own businesses, to improve sales and profitability. However, collecting, aggregating, and visualizing data is just one part of the equation. Being able to extract useful information from data is another task, and a much more challenging one. Machine Learning with R Cookbook, Second Edition uses a practical approach to teach you how to perform machine learning with R. Each chapter is divided into several simple recipes. Through the step-by-step instructions provided in each recipe, you will be able to construct a predictive model by using a variety of machine learning packages. In this book, you will first learn to set up the R environment and use simple R commands to explore data. The next topic covers how to perform statistical analysis with machine learning analysis and assess created models, covered in detail later on in the book. You'll also learn how to integrate R and Hadoop to create a big data analysis platform. The detailed illustrations provide all the information required to start applying machine learning to individual projects. With Machine Learning with R Cookbook, machine learning has never been easier.
Table of Contents (15 chapters)
close
close

Manipulating data

This recipe will discuss how to use the built-in R functions to manipulate data. As data manipulation is the most time-consuming part of most analysis procedures, you should gain knowledge of how to apply these functions on data.

Getting ready

Ensure you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to manipulate the data with R.

Subset the data using the bracelet notation:

  1. Load the dataset iris into the R session:
        > data(iris)  
  1. To select values, you may use a bracket notation that designates the indices of the dataset. The first index is for the rows and the second for the columns:
        > iris[1,"Sepal.Length"]
        Output:
    
        [1] 5.1  
  1. You can also select multiple columns using c():
        > Sepal.iris = iris[, c("Sepal.Length", "Sepal.Width")]  
  1. You can then use str() to summarize and display the internal structure of Sepal.iris:
        > str(Sepal.iris)
        Output:
       'data.frame':  150 obs. of  2 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ..  
  1. To subset data with the rows of given indices, you can specify the indices at the first index with the bracket notation. In this example, we show you how to subset data with the top five records with the Sepal.Length column and the Sepal.Width selected:
        > Five.Sepal.iris = iris[1:5, c("Sepal.Length", "Sepal.Width")]
        > str(Five.Sepal.iris)
        Output:
        'data.frame':   5 obs. of  2 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 
  1. It is also possible to set conditions to filter the data. For example, to filter returned records containing the setosa data with all five variables. In the following example, the first index specifies the returning criteria, and the second index specifies the range of indices of the variable returned:
        > setosa.data = iris[iris$Species=="setosa",1:5]
        > str(setosa.data)
        Output:
        'data.frame':   50 obs. of  5 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
        $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
        $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
        $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1
1 1 1 1 1 1 1 ...
  1. Alternatively, the which function returns the indexes of satisfied data. The following example returns the indices of the iris data containing species equal to setosa:
        > which(iris$Species=="setosa")
        Output:
        [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
        [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
        [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50
  1. The indices returned by the operation can then be applied as the index to select the iris containing the setosa species. The following example returns the setosa with all five variables:
        > setosa.data = iris[which(iris$Species=="setosa"),1:5]
        > str(setosa.data)
        Output:
        'data.frame':   50 obs. of  5 variables:
         $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
         $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
         $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
         $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
         $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 
1 1 1 1 1 1 1 ...

Subset data using the subset function:

  1. Besides using the bracket notation, R provides a subset function that enables users to subset the DataFrame by observations with a logical statement.
  2. First, subset species, sepal length, and sepal width out of the iris data. To select the sepal length and width out of the iris data, one should specify the column to be subset in the select argument:
        > Sepal.data = subset(iris, select=c("Sepal.Length", "Se-
pal.Width")) > str(Sepal.data) Output: 'data.frame': 150 obs. of 2 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

This reveals that Sepal.data contains 150 objects with the Sepal.Length variable and Sepal.Width.

  1. On the other hand, you can use a subset argument to get subset data containing setosa only. In the second argument of the subset function, you can specify the subset criteria:
        > setosa.data = subset(iris, Species =="setosa")
        > str(setosa.data)
        Output:
       'data.frame': 50 obs. of  5 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
        $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
        $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
        $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1
1 1 1 1 1 1 1 ...
  1. Most of the time, you may want to apply a union or intersect a condition while subsetting data. The OR and AND operations can be further employed for this purpose. For example, if you would like to retrieve data with Petal.Width >=0.2 and Petal.Length < = 1.4:
        > example.data= subset(iris, Petal.Length <=1.4 & Petal.Width >=
0.2, select=Species ) > str(example.data) Output: 'data.frame': 21 obs. of 1 variable: $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
1 1 1 1 ...
  • Merging data: Merging data involves joining two DataFrames into a merged DataFrame by a common column or row name. The following example shows how to merge the flower.type DataFrame and the first three rows of the iris with a common row name within the Species column:
        > flower.type = data.frame(Species = "setosa", Flower = "iris")
        > merge(flower.type, iris[1:3,], by ="Species")
        Output:
        Species Flower Sepal.Length Sepal.Width Petal.Length Petal.Width
      1  setosa   iris          5.1         3.5          1.4         0.2
      2  setosa   iris          4.9         3.0          1.4         0.2
      3  setosa   iris          4.7         3.2          1.3         0.2
  • Ordering data: The order function will return the index of a sorted DataFrame with a specified column. The following example shows the results from the first six records with the sepal length ordered (from big to small) iris data:
        > head(iris[order(iris$Sepal.Length, decreasing = TRUE),])
        Output:
          Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
        132          7.9         3.8          6.4         2.0 virginica
        118          7.7         3.8          6.7         2.2 virginica
        119          7.7         2.6          6.9         2.3 virginica
        123          7.7         2.8          6.7         2.0 virginica
        136          7.7         3.0          6.1         2.3 virginica
        106          7.6         3.0          6.6         2.1 virginica
    
  

How it works...

Before conducting data analysis, it is important to organize collected data into a structured format. Therefore, we can simply use the R DataFrame to subset, merge, and order a dataset. This recipe first introduces two methods to subset data: one uses the bracket notation, while the other uses the subset function. You can use both methods to generate the subset data by selecting columns and filtering data with the given criteria. The recipe then introduces the merge function to merge DataFrames. Last, the recipe introduces how to use order to sort the data.

There's more...

The sub and gsub functions allow using regular expression to substitute a string. The sub and gsub functions perform the replacement of the first and all the other matches, respectively:

> sub("e", "q", names(iris))
Output:
[1] "Sqpal.Length" "Sqpal.Width"  "Pqtal.Length" "Pqtal.Width"  "Spqcies"     
> gsub("e", "q", names(iris))
Output:
[1] "Sqpal.Lqngth" "Sqpal.Width"  "Pqtal.Lqngth" "Pqtal.Width"  "Spqciqs"
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Machine Learning with R Cookbook, Second Edition
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon