Book Image

Regression Analysis with R

By : Giuseppe Ciaburro
Book Image

Regression Analysis with R

By: Giuseppe Ciaburro

Overview of this book

Regression analysis is a statistical process which enables prediction of relationships between variables. The predictions are based on the casual effect of one variable upon another. Regression techniques for modeling and analyzing are employed on large set of data in order to reveal hidden relationship among the variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. The first few chapters give an understanding of what the different types of learning are – supervised and unsupervised, how these learnings differ from each other. We then move to covering the supervised learning in details covering the various aspects of regression analysis. The outline of chapters are arranged in a way that gives a feel of all the steps covered in a data science process – loading the training dataset, handling missing values, EDA on the dataset, transformations and feature engineering, model building, assessing the model fitting and performance, and finally making predictions on unseen datasets. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. The practical examples are illustrated using R code including the different packages in R such as R Stats, Caret and so on. Each chapter is a mix of theory and practical examples. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.
Table of Contents (15 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

R packages for regression


Previously, we have mentioned the R packages, which allow us to access a series of features to solve a specific problem. In this section, we will present some packages that contain valuable resources for regression analysis. These packages will be analyzed in detail in the following chapters, where we will provide practical applications.

The R stats package

R stats is a package that contains many useful functions for statistical calculations and random number generation. In the following table you will see some of the information on this package:

Package

stats

Date

October 3, 2017

Version

3.5.0

Title                    

 

The R stats package

 

Author

 

R core team and contributors worldwide

 

 

There are so many functions in the package; we will only mention the ones that are closest to regression analysis. These are the most useful functions used in regression analysis:

  • lm: This function is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance, and analysis of co-variance.
  • summary.lm: This function returns a summary for linear model fits.
  • coef: With the help of this function, coefficients from objects returned by modeling functions can be extracted. Coefficients is an alias for it.
  • fitted: Fitted values are extracted by this function from objects returned by modeling functions fitted. Values are an alias for it.
  • formula: This function provides a way of extracting formulae which have been included in other objects.
  • predict: This function predicts values based on linear model objects.
  • residuals: This function extracts model residuals from objects returned by modeling functions.
  • confint: This function computes confidence intervals for one or more parameters in a fitted model. Base has a method for objects inheriting from the lm class.
  • deviance: This function returns the deviance of a fitted model object.
  • influence.measures: This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models (GLM).
  • lm.influence: This function provides the basic quantities used when forming a wide variety of diagnostics for checking the quality of regression fits.
  • ls.diag: This function computes basic statistics, including standard errors, t-values, and p-values for the regression coefficients.
  • glm: This function is used to fit GLMs, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
  • loess: This function fits a polynomial surface determined by one or more numerical predictors, using local fitting.
  • loess.control: This function sets control parameters for loess fits.
  • predict.loess: This function extracts predictions from a loess fit, optionally with standard errors.
  • scatter.smooth: This function plots and adds a smooth curve computed by loess to a scatter plot.

What we have analyzed are just some of the many functions contained in the stats package. As we can see, with the resources offered by this package we can build a linear regression model, as well as GLMs (such as multiple linear regression, polynomial regression, and logistic regression). We will also be able to make model diagnosis in order to verify the plausibility of the classic hypotheses underlying the regression model, but we can also address local regression models with a non-parametric approach that suits multiple regressions in the local neighborhood.

The car package

This package includes many functions for: ANOVA analysis, matrix and vector transformations, printing readable tables of coefficients from several regression models, creating residual plots, tests for the autocorrelation of error terms, and many other general interest statistical and graphing functions.

In the following table you will see some of the information on this package:

Package

car

Date

June 25, 2017

Version

2.1-5

Title                    

Companion to Applied Regression

Author

John Fox, Sanford Weisberg, and many others

 

The following are the most useful functions used in regression analysis contained in this package:

  • Anova: This function returns ANOVA tables for linear and GLMs
  • linear.hypothesis: This function is used for testing a linear hypothesis and methods for linear models, GLMs, multivariate linear models, and linear and generalized linear mixed-effects models
  • cookd: This function returns Cook's distances for linear and GLMs
  • outlier.test: This function reports the Bonferroni p-values for studentized residuals in linear and GLMs, based on a t-test for linear models and a normal-distribution test for GLMs
  • durbin.watson: This function computes residual autocorrelations and generalized Durbin-Watson statistics and their bootstrapped p-values
  • levene.test: This function computes Levene's test for the homogeneity of variance across groups
  • ncv.test: This function computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors

What we have listed are just some of the many functions contained in the stats package. In this package, there are also many functions that allow us to draw explanatory graphs from information extracted from regression models as well as a series of functions that allow us to make variables transformations.

The MASS package

This package includes many useful functions and data examples, including functions for estimating linear models through generalized least squares (GLS), fitting negative binomial linear models, the robust fitting of linear models, and Kruskal's non-metric multidimensional scaling.

In the following table you will see some of the information on this package:

Package

MASS

Date

October 2, 2017

Version

7.3-47

Title                    

Support Functions and Datasets for Venables and Ripley's MASS

Author

Brian Ripley, Bill Venables, and many others

 

The following are the most useful functions used in regression analysis contained in this package:

  • lm.gls: This function fits linear models by GLS
  • lm.ridge: This function fist a linear model by Ridge regression
  • glm.nb: This function contains a modification of the system function 
  • glm(): It includes an estimation of the additional parameter, theta, to give a negative binomial GLM
  • polr: A logistic or probit regression model to an ordered factor response is fitted by this function
  • lqs: This function fits a regression to the good points in the dataset, thereby achieving a regression estimator with a high breakdown point
  • rlm: This function fits a linear model by robust regression using an M-estimator
  • glmmPQL: This function fits a GLMM model with multivariate normal random effects, using penalized quasi-likelihood (PQL
  • boxcox: This function computes and optionally plots profile log-likelihoods for the parameter of the Box-Cox power transformation for linear models

As we have seen, this package contains many useful features in regression analysis; in addition there are numerous datasets that we can use for our examples that we will encounter in the following chapters.

The caret package

This package contains many functions to streamline the model training process for complex regression and classification problems. The package utilizes a number of R packages.

In the following table you will see listed some of the information on this package:

Package

caret

Date

September 7, 2017

Version

6.0-77

Title                    

Classification and Regression Training

Author

Max Kuhn and many others

 

The most useful functions used in regression analysis in this package are as follows:

  • train: Predictive models over different tuning parameters are fitted by this function. It fits each model, sets up a grid of tuning parameters for a number of classification and regression routines, and calculates a resampling-based performance measure.
  • trainControl: This function permits the estimation of parameter coefficients with the help of resampling methods like cross-validation.
  • varImp: This function calculates variable importance for the objects produced by train and method-specific methods.
  • defaultSummary: This function calculates performance across resamples. Given two numeric vectors of data, the mean squared error and R-squared error are calculated. For two factors, the overall agreement rate and Kappa are determined.
  • knnreg: This function performs K-Nearest Neighbor (KNN) regression that can return the average value for the neighbors.
  • plotObsVsPred: This function plots observed versus predicted results in regression and classification models.
  • predict.knnreg: This function extracts predictions from the KNN regression model.

The caret package contains hundreds of machine learning algorithms (also for regression), and renders useful and convenient methods for data visualization, data resampling, model tuning, and model comparison, among other features.

The glmnet package

This package contains many extremely efficient procedures in order to fit the entire Lasso or ElasticNet regularization path for linear regression, logistic and multinomial regression models, Poisson regression, and the Cox model. Multiple response Gaussian and grouped multinomial regression are the two recent additions.

In the following table you will see listed some of the information on this package:

Package

glmnet

Date

September 21, 2017

Version

2.0-13

Title                    

Lasso and Elastic-Net Regularized Generalized Linear Models

Author

Jerome Friedman, Trevor Hastie, Noah Simon, Junyang Qian, and Rob Tibshirani

 

The following are the most useful functions used in regression analysis contained in this package:

  • glmnet:  A GLM is fit by this function via penalized maximum likelihood. The regularization path is computed for the Lasso or ElasticNet penalty at a grid of values for the regularization parameter lambda. This function can also deal with all shapes of data, including very large sparse data matrices. Finally, it fits linear, logistic and multinomial, Poisson, and Cox regression models.
  • glmnet.control: This function views and/or changes the factory default parameters in glmnet.
  • predict.glmnet: This function predicts fitted values, logits, coefficients, and more from a fitted glmnet object.
  • print.glmnet: This function prints a summary of the glmnet path at each step along the path.
  • plot.glmnet: This function produces a coefficient profile plot of the coefficient paths for a fitted glmnet object.
  • deviance.glmnet: This function computes the deviance sequence from the glmnet object.

As we have mentioned, this package fits Lasso and ElasticNet model paths for regression, logistic, and multinomial regression using coordinate descent. The algorithm is extremely fast, and exploits sparsity in the input matrix where it exists. A variety of predictions can be made from the fitted models.

The sgd package

This package contains a fast and flexible set of tools for large scale estimation. It features many stochastic gradient methods, built-in models, visualization tools, automated hyperparameter tuning, model checking, interval estimation, and convergence diagnostics.

In the following table you will see listed some of the information on this package:

Package

sgd

Date

January 5, 2016

Version

1.1

Title                    

Stochastic Gradient Descent for Scalable Estimation

Author

Dustin Tran, Panos Toulis, Tian Lian, Ye Kuang, and Edoardo Airoldi

 

The following are the most useful functions used in regression analysis contained in this package:

  • sgd: This function runs Stochastic Gradient Descent (SGD) in order to optimize the induced loss function given a model and data
  • print.sgd: This function prints objects of the sgd class
  • predict.sgd: This function forms predictions using the estimated model parameters from SGD
  • plot.sgd: This function plots objects of the sgd class

The BLR package

This package performs a special case of linear regression named Bayesian linear regression. In Bayesian linear regression, the statistical analysis is undertaken within the context of a Bayesian inference.

In the following table you will see listed some of the information on this package:

Package

BLR

Date

December 3, 2014

Version

1.4

Title                    

Bayesian Linear Regression

Author

Gustavo de los Campos, Paulino Perez Rodriguez

The following are the most useful functions used in regression analysis contained in this package:

  • BLR: This function was designed to fit parametric regression models using different types of shrinkage methods.
  • sets: This is a vector (599x1) that assigns observations to ten disjointed sets; the assignment was generated at random. This is used later to conduct a 10-fold CV.

The Lars package

This package contains efficient procedures for fitting an entire Lasso sequence with the cost of a single least squares fit. Least angle regression and infinitesimal forward stagewise regression are related to the Lasso.

In the following table you will see listed some of the information on this package:

Package

Lars

Date

April 23, 2013

Version

1.2

Title                    

Least Angle Regression, Lasso and Forward Stagewise

Author

Trevor Hastie and Brad Efron

The following are the most useful functions used in regression analysis contained in this package:

  • lars: This function fits least angle regression and Lasso and infinitesimal forward stagewise regression models.
  • summary.lars: This function produces an ANOVA-type summary for a lars object.
  • plot.lars: This function produce a plot of a lars fit. The default is a complete coefficient path.
  • predict.lars: This function make predictions or extracts coefficients from a fitted lars model.