Book Image

Statistical Application Development with R and Python - Second Edition

Book Image

Statistical Application Development with R and Python - Second Edition

Overview of this book

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. This book explores statistical concepts along with R and Python, which are well integrated from the word go. Almost every concept has an R code going with it which exemplifies the strength of R and applications. The R code and programs have been further strengthened with equivalent Python programs. Thus, you will first understand the data characteristics, descriptive statistics and the exploratory attitude, which will give you firm footing of data analysis. Statistical inference will complete the technical footing of statistical methods. Regression, linear, logistic modeling, and CART, builds the essential toolkit. This will help you complete complex problems in the real world. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. The data analysis journey begins with exploratory analysis, which is more than simple, descriptive, data summaries. You will then apply linear regression modeling, and end with logistic regression, CART, and spatial statistics. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.
Table of Contents (19 chapters)
Statistical Application Development with R and Python - Second Edition
Credits
About the Author
Acknowledgment
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Continuous distributions


The numeric variables in the survey, Age, Mileage, and Odometer, can take any values over a continuous interval and these are examples of continuous RVs. In the previous section, we dealt with RVs that had discrete output. In this section, we will deal with RVs that have continuous output. A distinction from the previous section needs to be pointed out explicitly.

In the case of a discrete RV, there is a positive number for the probability of an RV taking on a certain value that is determined by the pmf. In the continuous case, an RV necessarily assumes any specific value with zero probability. These technical issues cannot be discussed in this book. In the discrete case, the probabilities of certain values are specified by the pmf, and in the continuous case the probabilities, over intervals, are decided by the probability density function, abbreviated as pdf.

Suppose that we have a continuous RV X with the pdf f(x) defined over the possible x values; that is, we assume that the pdf f(x) is well defined over the range of the RV X, denoted by . It is necessary that the integration of f(x) over the range is necessarily 1; that is, .The probability that the RV X takes a value in an interval [a, b] is defined by:

In general, we are interested in the cumulative probabilities of a continuous RV, which is the probability of the event P(X<x). In terms of the previous equations, this is obtained as:

A special name for this probability is the cumulative density function. The mean and variance of a continuous RV are then defined by:

As in the previous section, we will begin with the simpler RV in uniform distribution.

Uniform distribution

A RV is said to have uniform distribution over the interval if its probability density function is given by:

In fact, it is not necessary to restrict our focus on the positive real line. For any two real numbers a and b, from the real line, with b > a, the uniform RV can be defined by:

The uniform distribution has a very important role to play in simulation, as will be seen in Chapter 6, Simulation. As with the discrete counterpart, in the continuous case any two intervals of the same length will have an equal probability occurring. The mean and variance of a uniform RV over the interval [a, b] are respectively given by:

Example 1.4.1. Horgan’s (2008), Example 15.3: The International Journal of Circuit Theory and Applications reported in 1990 that researchers at the University of California, Berkeley, had designed a switched capacitor circuit for generating random signals whose trajectory is uniformly distributed over the unit interval [0, 1]. Suppose that we are interested in calculating the probability that the trajectory falls in the interval [0.35, 0.58]. Though the answer is straightforward, we will obtain it using the punif function:

> punif(0.58)-punif(0.35)
[1] 0.23

Of course, we don’t need software for such simple integrals, nevertheless:

Exponential distribution

The exponential distribution is probably one of the most important probability distributions in statistics, and more so for computer scientists. The numbers of arrivals in a queuing system, the time between two incoming calls on a mobile, the lifetime of a laptop, and so on, are some of the important applications where this distribution has a lasting utility value. The pdf of an exponential RV is specified by:

The parameter is sometimes referred to as the failure rate. The exponential RV enjoys a special property called the memory-less property, which conveys that:

The mathematical statement translates into the property that if X is an exponential RV, then its failure in the future depends on the present, and the past (age) of the RV does not matter. In simple words, this means that the probability of failure is constant in time and does not depend on the age of the system. Let us obtain the plots of a few exponential distributions:

> par(mfrow=c(1,2))
> curve(dexp(x,1),0,10,ylab=”f(x)”,xlab=”x”,cex.axis=1.25)
> curve(dexp(x,0.2),add=TRUE,col=2)
> curve(dexp(x,0.5),add=TRUE,col=3)
> curve(dexp(x,0.7),add=TRUE,col=4)
> curve(dexp(x,0.85),add=TRUE,col=5)
> legend(6,1,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch= 
+ "___”)
> curve(dexp(x,50),0,0.5,ylab=”f(x)”,xlab=”x”)
> curve(dexp(x,10),add=TRUE,col=2)
> curve(dexp(x,20),add=TRUE,col=3)
> curve(dexp(x,30),add=TRUE,col=4)
> curve(dexp(x,40),add=TRUE,col=5)
> legend(0.3,50,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch= 
+ "___”)

The exponential densities

The mean and variance of this exponential distribution are listed as follows:

The complete Python code block is given next:

Normal distribution

The normal distribution is in some sense an all-pervasive distribution that arises sooner or later in almost any statistical discussion. In fact, it is very likely that the reader may already be familiar with certain aspects of the normal distribution; for example, the shape of a normal distribution curve is bell-shaped. The mathematical appropriateness is probably reflected through the reason that though it has a simpler expression, its density function includes the three most famous irrational numbers

Suppose that X is normally distributed with the mean and the variance . Then, the probability density function of the normal RV is given by:

If the mean is zero and the variance is 1, the normal RV is referred to as the standard normal RV, and the standard is to denote it by Z.

Example 1.4.2. Shady Normal Curves: We will again consider a standard normal random variable, which is more popularly denoted in Statistics by Z. Some of the most needed probabilities are P(Z > 0) and P(-1.96 < Z < 1.96). These probabilities are now shaded:

> par(mfrow=c(3,1))
> # Probability Z Greater than 0
> curve(dnorm(x,0,1),-4,4,xlab=”z”,ylab=”f(z)”)
> z=seq(0,4,0.02)
> lines(z,dnorm(z),type=”h”,col=”grey”)
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab=”z”,ylab=”f(z)”)
> z=seq(-1.96,1.96,0.001)
> lines(z,dnorm(z),type=”h”,col=”grey”)
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab=”z”,ylab=”f(z)”)
> z=seq(-2.58,2.58,0.001)
> lines(z,dnorm(z),type=”h”,col=”grey”)

Shady normal curves

The Python program for the shady normal probabilities is given next: