Book Image

Machine Learning with R Quick Start Guide

By : Iván Pastor Sanz
Book Image

Machine Learning with R Quick Start Guide

By: Iván Pastor Sanz

Overview of this book

Machine Learning with R Quick Start Guide takes you on a data-driven journey that starts with the very basics of R and machine learning. It gradually builds upon core concepts so you can handle the varied complexities of data and understand each stage of the machine learning pipeline. From data collection to implementing Natural Language Processing (NLP), this book covers it all. You will implement key machine learning algorithms to understand how they are used to build smart models. You will cover tasks such as clustering, logistic regressions, random forests, support vector machines, and more. Furthermore, you will also look at more advanced aspects such as training neural networks and topic modeling. By the end of the book, you will be able to apply the concepts of machine learning, deal with data-related problems, and solve them using the powerful yet simple language that is R.
Table of Contents (9 chapters)

All about R packages

Packages in R are a collection of functions and datasets that are developed by the community.

Installing packages

Although R contains several functions in its basic installation, we will need to install additional packages to add new R functionalities. For example, with R it is possible to visualize data using the plot function. Nevertheless, we could install the ggplot2 package to obtain more pretty plots.

A package mainly includes R code (not always just R code), documentation with explanations about the package and functions inside it, examples, and even datasets.

Packages are placed on different repositories where you can install them.

Two of the most popular repositories for R packages are as follows:

  • CRAN: The official repository, maintained by the R community around the world. All of the packages that are published on this repository should meet quality standards.
  • GitHub: This repository is not specific for R packages, but many of the packages have open source projects located in them. Unlike CRAN, there is no review process when a package is published.

To install a package from CRAN, use the install.packages() command. For example, the ggplot2 package can be installed using the following command:

install.packages("ggplot2")

To install packages from repositories other than CRAN, I would recommend using the devtools package:

install.packages("devtools")

This package simplifies the process of installing packages from different repositories. With this package, some functions are available, depending on the repository you want to download a package from.

For example, use install_cran to download a package from CRAN or install_github() to download it from GitHub.

After the package has been downloaded and installed, we'll load it into our current R session using the library function. It is important to load packages so that we can use these new functions in our R session:

library(ggplot2)

The require function can be used to load a package. The only difference between require and library is that, in case the specific package is not found, library will show an error, but require will continue the execution of code without showing any error.

Necessary packages

To run all the code that's presented in this book, you need to install some of the packages we have mentioned. Specifically, you need to install the following packages (alphabetically ordered):

  • Amelia: Package for missing data visualization and imputation.
  • Boruta: Implements a feature selection algorithm for finding relevant variables.
  • caret: This package (short for classification and regression training) implements several machine learning algorithms for building predictive models.
  • caTools: Contains several basic utility functions, including predictive metrics or functions to split samples.
  • choroplethr/choroplethrMaps: Creates maps in R.
  • corrplot: Calculates correlation among variables and displays them graphically.
  • DataExplorer: Includes different functions for data exploration process.
  • dplyr: Package for data manipulation.
  • fBasics: Includes techniques of explorative data analysis.
  • funModeling: Functions for data cleaning, importance variable analysis, and model performance.
  • ggfortify: Functions for data visualization tools for statistical analysis.
  • ggplot2: System for declaratively creating graphics.
  • glmnet: A package oriented toward Lasso and elastic-net regularized regression models.
  • googleVis: R interface to Google Charts.
  • h2o: A package that includes fast and scalable algorithms, including gradient boosting, random forest, and deep learning.
  • h2oEnsemble: Provides functionality to create ensembles from the base learning algorithms that are accessible via the h2o package.
  • Hmisc: Contains many functions that are useful for data analysis and also for importing files from different formats.
  • kohonen: Facilitates the creation and visualization of self-organizing maps.
  • lattice: A package to create powerful graphs.
  • lubridate: Incorporates functions to work with dates in an easy way.
  • MASS: Contains several statistical functions.
  • plotrix: This has many plots, labeling, axis, and color scaling functions.
  • plyr: This contains tools that can split, apply, and combine data.
  • randomForest: Algorithms for random forests for classification and regression.
  • rattle: This provides a GUI for different R packages that can aid in data mining.
  • readr: Provides a fast and friendly way to read files from .csv, .tsv, or .fwf files.
  • readtext: Functions to import and handle plain and formatted text files.
  • recipes: Useful package for data manipulation and analysis.
  • rpart: Implements classification and regression trees.
  • rpart.plot: The easiest way to plot a tree that's created using the rpart package.
  • Rtsne: Implementation of t-distributed Stochastic Neighbor Embedding (t-SNE).
  • RWeka: RWeka has many algorithms for data mining and also tools that can pre-process and classify data. It provides an easy interface to perform operations like regression, clustering, association, and visualization.
  • rworldmap: Enables mapping of country-level and gridded user datasets.
  • scales: This provides methods that can automatically detect breaks, determine labels for axes, and legends. It does the work of mapping.
  • smbinning: A set of functions to build scoring models.
  • SnowballC: This can easily implement the very famous Porter's word stemming algorithm that collapses words into root nodes and compares the vocabulary.
  • sqldf: Functions to manipulate R data frames using SQL.
  • tibbletime: Useful functions to work with time series.
  • tidyquant: A package focused on retrieving, manipulating, and scaling financial data analysis in the easiest way possible.
  • tidyr: Includes functions for data frame manipulations.
  • tidyverse: This is one package that contains packages for manipulating the data, exploring, and visualizing.
  • tm: Package for text mining in R.
  • VIM: Using this package, missing packages can be visualized.
  • wbstats: This gives you access to data and statistics from the World Bank API.
  • WDI: Search, extract, and format data from the World Bank's World Development Indicators (WDI).
  • wordcloud: This package gives you the powerful functions that can help you in creating pretty word clouds. It can also help in visualizing the differences and similarities between two documents.

Once these packages have been installed, we can start working with all the code that's contained in the following chapters.