Book Image

Statistical Application Development with R and Python - Second Edition

Book Image

Statistical Application Development with R and Python - Second Edition

Overview of this book

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. This book explores statistical concepts along with R and Python, which are well integrated from the word go. Almost every concept has an R code going with it which exemplifies the strength of R and applications. The R code and programs have been further strengthened with equivalent Python programs. Thus, you will first understand the data characteristics, descriptive statistics and the exploratory attitude, which will give you firm footing of data analysis. Statistical inference will complete the technical footing of statistical methods. Regression, linear, logistic modeling, and CART, builds the essential toolkit. This will help you complete complex problems in the real world. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. The data analysis journey begins with exploratory analysis, which is more than simple, descriptive, data summaries. You will then apply linear regression modeling, and end with logistic regression, CART, and spatial statistics. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.
Table of Contents (19 chapters)
Statistical Application Development with R and Python - Second Edition
Credits
About the Author
Acknowledgment
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Experiments with uncertainty in computer science


The common man of the previous century was skeptical about chance/randomness and attributed it to the lack of accurate instruments, and that information is not necessarily captured in many variables. The skepticism about the need for modeling for randomness in the current era continues for the common man, as he feels that the instruments are too accurate and that multi-variable information eliminates uncertainty. However, this is not the fact and we will look here at some examples that drive home this point.

In the previous section, we dealt with data arising from a questionnaire regarding the service level at a car dealer. It is natural to accept that different individuals respond in distinct ways, and further, the car being a complex assembly of different components, responds differently in near identical conditions. A question then arises as to whether we may have to really deal with such situations in computer science, which involve uncertainty. The answer is certainly affirmative and we will consider some examples in the context of computer science and engineering.

Suppose that the task is the installation of software, say R itself. At a new lab there has been an arrangement of 10 new desktops that have the same configuration. That is, the RAM, memory, the processor, operating system, and so on, are all same in the 10 different machines.

For simplicity, assume that the electricity supply and lab temperature are identical for all the machines. Do you expect that the complete R installation, as per the directions specified in the next section, will be the same in milliseconds for all the 10 installations? The runtime of an operation can be easily recorded, maybe using other software if not manually. The answer is a clear No as there will be minor variations of the processes active in the different desktops. Thus, we have our first experiment in the domain of computer science that involves uncertainty.

Suppose that the lab is now 2 years old. As an administrator, do you expect all the 10 machines to be working in the same identical conditions as we started with an identical configuration and environment? The question is relevant, as according to general experience, a few machines may have broken down. Despite warranty and assurance by the desktop company, the number of machines that may have broken down will not be exactly the same as those assured. Thus, we again have uncertainty.

Assume that three machines are not functioning at the end of 2 years. As an administrator, you have called the service vendor to fix the problem. For the sake of simplicity, we assume that the nature of failure of the three machines is the same, say motherboard failure on the three failed machines. Is it practical that the vendor would fix the three machines within an identical time?

Again, by experience, we know that this is very unlikely. If the reader thinks otherwise, assume that 100 identical machines were running for 2 years and 30 of them now have the motherboard issue. It is now clear that some machines may require a component replacement while others would start functioning following a repair/fix.

Let us now summarize the preceding experiments with following questions:

  • What is the average installation time for the R software on identically configured computer machines?

  • How many machines are likely to break down after a period of 1 year, 2 years, and 3 years?

  • If a failed machine has issues related to the motherboard, what is the average service time?

  • What is the fraction of failed machines that have a failed motherboard component?

The answers to these types of questions form the main objective of the Statistics subject. However, there are certain characteristics of uncertainty that are covered by the families of probability distributions. According to the underlying problem, we have discrete or continuous RVs. The important and widely useful probability distributions form the content of the rest of the chapter. We will begin with the useful discrete distributions.