Book Image

Simulation for Data Science with R

By : Matthias Templ
Book Image

Simulation for Data Science with R

By: Matthias Templ

Overview of this book

Data Science with R aims to teach you how to begin performing data science tasks by taking advantage of Rs powerful ecosystem of packages. R being the most widely used programming language when used with data science can be a powerful combination to solve complexities involved with varied data sets in the real world. The book will provide a computational and methodological framework for statistical simulation to the users. Through this book, you will get in grips with the software environment R. After getting to know the background of popular methods in the area of computational statistics, you will see some applications in R to better understand the methods as well as gaining experience of working with real-world data and real-world problems. This book helps uncover the large-scale patterns in complex systems where interdependencies and variation are critical. An effective simulation is driven by data generating processes that accurately reflect real physical populations. You will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results. By the end of this book, you reader will get in touch with the software environment R. After getting background on popular methods in the area, you will see applications in R to better understand the methods as well as to gain experience when working on real-world data and real-world problems.
Table of Contents (13 chapters)
12
Index

Chapter 1. Introduction

In the previous century, the Vienna University of Technology in Vienna enrolled a bachelor study called data engineering and statistics. Basically the content was perfectly related to the nowadays commonly-used term data science. Data-oriented lectures in the area of computer science, such as storing and retrieving data, programming, and data security, were in the curriculum, together with applied lectures on statistics, such as multivariate statistics, biostatistics, financial statistics, statistical learning, and official statistics. We had too few students and after a few years the course was canceled. 16 years later, the picture completely changed. New bachelors and masters courses on data science have been developed everywhere in the world over the last few years. Universities have found that they must offer studies on data science, because the industry needs experts on it, but also developments in statistics in recent years have almost exclusively come from an area called computational statistics. Statistics is the original form of computing data, and computational statistics takes this to an extreme where methods and tools are developed in a highly data-dependent manner, using and developing modern computational tools. Computational statistics and data science are closely related. Computational statistics covers a broad swathe of data science, exclusive data management, and data security issues. Computational statistics (and therefore also data science) has become very popular since the eighties, and it is very likely the most influential area of statistics nowadays. In the field of computational statistics, not only is new methodology developed, but it is also implemented in software – nowadays almost exclusively in the old but modern software environment R.

Data science seems like a good term when your work is driven by data with a less strong component on method and algorithm development than computational statistics, but with a lot of pure computer science topics related to storing, retrieving, and handling data sets. It also differs from computational statistics in various aspects. For example, in the area of data visualization also pure process-related visualizations (airflows in an engine, for example) are a topic in data science but not in computational statistics.

Wikipedia defines data science as a field that:

"incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products."

Data science is the management of the entire modeling process, from data collection, storage and managing data, data pre-processing (editing, imputation), data analysis, and modeling, to automatized reporting and presenting the results, all in a reproducible manner. It is thus also an interdisciplinary study to extract meaning from data with statistics, by using a lot of elements in computer science, as well as general subject-matter skills. In that sense, data science is an extension and continuation of statistics. Data scientists use statistics and data-oriented computer science tools to solve the problems they face.

Statistical simulation is an essential area in data science. The core issues of this book are simulating distributions and data sets, Monte Carlo methods for inference statistics, and presenting solutions on computer-intense approaches. This book discusses various areas in statistical simulation, random number simulation, resampling, Monte Carlo methods, statistical theory explained by simulation experiments, agent-based microsimulation, and system dynamics. The aim is to put a book into the hands of readers that explains methods, gives advice on the use of those methods, and provides computational tools to solve common problems in statistical simulation and computer-intense methods.

In this book, the theory is not just explained. The theory is also made understandable with illustrative examples using the R software environment. The reader will get to grips with the R software environment. After getting the background on popular methods in the field, readers will see applications in R to better understand the methods, as well as to gain experience when working on real-world data and real-world problems.

R itself is perfectly suited to carry out simulations. It should be mentioned that the basics of R are not the topic of the book, but advanced data manipulation and advanced visualization tools are shown in R. The reader should therefore not be a complete newbie in R, and if so, should first read a very basic introduction to R.

Readers will get a brief overview of the problems and possibilities of data-driven simulation and resampling methods.