Book Image

Statistical Application Development with R and Python - Second Edition

Book Image

Statistical Application Development with R and Python - Second Edition

Overview of this book

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. This book explores statistical concepts along with R and Python, which are well integrated from the word go. Almost every concept has an R code going with it which exemplifies the strength of R and applications. The R code and programs have been further strengthened with equivalent Python programs. Thus, you will first understand the data characteristics, descriptive statistics and the exploratory attitude, which will give you firm footing of data analysis. Statistical inference will complete the technical footing of statistical methods. Regression, linear, logistic modeling, and CART, builds the essential toolkit. This will help you complete complex problems in the real world. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. The data analysis journey begins with exploratory analysis, which is more than simple, descriptive, data summaries. You will then apply linear regression modeling, and end with logistic regression, CART, and spatial statistics. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.
Table of Contents (19 chapters)
Statistical Application Development with R and Python - Second Edition
Credits
About the Author
Acknowledgment
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Packages and settings – R and Python


We will need four R packages in ridge, DAAG, splines, and MASS. The required Python packages are matplotlib, pandas, numpy, pylab, statsmodels, and sklearn:

  1. First set the working directory in R:

    setwd("MyPath/R/Chapter_06")
  2. Load the essential R packages:

    > library(RSADBE)
    > library(ridge)
    > library(DAAG)
    > library(splines)
    > library(MASS)
  3. Set the working directory and required packages and functions in Python now:

Using these packages and functions, we will be able to carry out the computations required in the rest of the chapter.

The overfitting problem

The limitation of the linear regression model is best understood through an example. I have created a hypothetical dataset for understanding the problem of overfitting. A scatterplot of the dataset is shown in the figure, A non-linear relationship displayed by scatter plot.

It appears from the scatterplot that, for x-values up to 6, there is a linear increase in y, and an eye-bird estimate of the...