Simulation for Data Science with R

Simulation for Data Science with R

By : Matthias Templ

Buy this Book

Simulation for Data Science with R

By: Matthias Templ

Buy this Book

Overview of this book

Data Science with R aims to teach you how to begin performing data science tasks by taking advantage of Rs powerful ecosystem of packages. R being the most widely used programming language when used with data science can be a powerful combination to solve complexities involved with varied data sets in the real world. The book will provide a computational and methodological framework for statistical simulation to the users. Through this book, you will get in grips with the software environment R. After getting to know the background of popular methods in the area of computational statistics, you will see some applications in R to better understand the methods as well as gaining experience of working with real-world data and real-world problems. This book helps uncover the large-scale patterns in complex systems where interdependencies and variation are critical. An effective simulation is driven by data generating processes that accurately reflect real physical populations. You will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results. By the end of this book, you reader will get in touch with the software environment R. After getting background on popular methods in the area, you will see applications in R to better understand the methods as well as to gain experience when working on real-world data and real-world problems.

Simulation for Data Science with R

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Introduction

What is simulation and where is it applied?

Why use simulation?

Simulation and big data

Choosing the right simulation technique

Summary

References

R and High-Performance Computing

The R statistical environment

Generic functions, methods, and classes

Data manipulation in R

High performance computing

Visualizing information

References

The Discrepancy between Pencil-Driven Theory and Data-Driven Computational Solutions

Machine numbers and rounding problems

Condition of problems

Summary

References

Simulation of Random Numbers

Real random numbers

Simulating pseudo random numbers

Simulation of non-uniform distributed random variables

Tests for random numbers

Summary

References

Monte Carlo Methods for Optimization Problems

Numerical optimization

Dealing with stochastic optimization

Summary

References

Probability Theory Shown by Simulation

Some basics on probability theory

Probability distributions

Winning the lottery

The weak law on large numbers

The central limit theorem

Properties of estimators

Summary

References

Resampling Methods

The bootstrap

Estimation of standard errors with bootstrapping

The parametric bootstrap

Estimating bias with bootstrap

The jackknife

Cross-validation

Summary

References

Applications of Resampling Methods and Monte Carlo Tests

The bootstrap in regression analysis

Proper variance estimation with missing values

Bootstrapping in time series

Bootstrapping in the case of complex sampling designs

Monte Carlo tests

Summary

The EM Algorithm

The basic EM algorithm

The EM algorithm by example of k-means clustering

The EM algorithm for the imputation of missing values

Summary

References

Simulation with Complex Data

Different kinds of simulation and software

Simulating data using complex models

Model-based simulation studies

Design-based simulation

Inserting missing values

Summary

System Dynamics and Agent-Based Models

Agent-based models

Dynamics in love and hate

Dynamic systems in ecological modeling

Summary

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 1. Introduction

In the previous century, the Vienna University of Technology in Vienna enrolled a bachelor study called data engineering and statistics. Basically the content was perfectly related to the nowadays commonly-used term data science. Data-oriented lectures in the area of computer science, such as storing and retrieving data, programming, and data security, were in the curriculum, together with applied lectures on statistics, such as multivariate statistics, biostatistics, financial statistics, statistical learning, and official statistics. We had too few students and after a few years the course was canceled. 16 years later, the picture completely changed. New bachelors and masters courses on data science have been developed everywhere in the world over the last few years. Universities have found that they must offer studies on data science, because the industry needs experts on it, but also developments in statistics in recent years have almost exclusively come from an area called computational statistics. Statistics is the original form of computing data, and computational statistics takes this to an extreme where methods and tools are developed in a highly data-dependent manner, using and developing modern computational tools. Computational statistics and data science are closely related. Computational statistics covers a broad swathe of data science, exclusive data management, and data security issues. Computational statistics (and therefore also data science) has become very popular since the eighties, and it is very likely the most influential area of statistics nowadays. In the field of computational statistics, not only is new methodology developed, but it is also implemented in software – nowadays almost exclusively in the old but modern software environment R.

Data science seems like a good term when your work is driven by data with a less strong component on method and algorithm development than computational statistics, but with a lot of pure computer science topics related to storing, retrieving, and handling data sets. It also differs from computational statistics in various aspects. For example, in the area of data visualization also pure process-related visualizations (airflows in an engine, for example) are a topic in data science but not in computational statistics.

Wikipedia defines data science as a field that:

"incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products."

Data science is the management of the entire modeling process, from data collection, storage and managing data, data pre-processing (editing, imputation), data analysis, and modeling, to automatized reporting and presenting the results, all in a reproducible manner. It is thus also an interdisciplinary study to extract meaning from data with statistics, by using a lot of elements in computer science, as well as general subject-matter skills. In that sense, data science is an extension and continuation of statistics. Data scientists use statistics and data-oriented computer science tools to solve the problems they face.

Statistical simulation is an essential area in data science. The core issues of this book are simulating distributions and data sets, Monte Carlo methods for inference statistics, and presenting solutions on computer-intense approaches. This book discusses various areas in statistical simulation, random number simulation, resampling, Monte Carlo methods, statistical theory explained by simulation experiments, agent-based microsimulation, and system dynamics. The aim is to put a book into the hands of readers that explains methods, gives advice on the use of those methods, and provides computational tools to solve common problems in statistical simulation and computer-intense methods.

In this book, the theory is not just explained. The theory is also made understandable with illustrative examples using the R software environment. The reader will get to grips with the R software environment. After getting the background on popular methods in the field, readers will see applications in R to better understand the methods, as well as to gain experience when working on real-world data and real-world problems.

R itself is perfectly suited to carry out simulations. It should be mentioned that the basics of R are not the topic of the book, but advanced data manipulation and advanced visualization tools are shown in R. The reader should therefore not be a complete newbie in R, and if so, should first read a very basic introduction to R.

Readers will get a brief overview of the problems and possibilities of data-driven simulation and resampling methods.

Simulation for Data Science with R

By : Matthias Templ

Simulation for Data Science with R

By: Matthias Templ

Overview of this book

Related Content you might be interested in

Current Title:

Simulation for Data Science with R

Chapter 1. Introduction