Book Image

Simulation for Data Science with R

By : Matthias Templ
Book Image

Simulation for Data Science with R

By: Matthias Templ

Overview of this book

Data Science with R aims to teach you how to begin performing data science tasks by taking advantage of Rs powerful ecosystem of packages. R being the most widely used programming language when used with data science can be a powerful combination to solve complexities involved with varied data sets in the real world. The book will provide a computational and methodological framework for statistical simulation to the users. Through this book, you will get in grips with the software environment R. After getting to know the background of popular methods in the area of computational statistics, you will see some applications in R to better understand the methods as well as gaining experience of working with real-world data and real-world problems. This book helps uncover the large-scale patterns in complex systems where interdependencies and variation are critical. An effective simulation is driven by data generating processes that accurately reflect real physical populations. You will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results. By the end of this book, you reader will get in touch with the software environment R. After getting background on popular methods in the area, you will see applications in R to better understand the methods as well as to gain experience when working on real-world data and real-world problems.
Table of Contents (18 chapters)
Simulation for Data Science with R
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Index

A

  • aes(). assignment / The ggplot2 package
  • aesthetic mapping / The ggplot2 package
  • agent-based modeling / What is simulation and where is it applied?
  • agent-based modeling (ABM) / Choosing the right simulation technique
  • agent-based models
    • about / Agent-based models
  • alias method / The alias method
  • arithmetic random number generators / Simulating pseudo random numbers

B

  • Beta distribution / Simulating random numbers from a Beta distribution
  • BFGS method / Further general-purpose optimization methods
  • bias
    • estimating, bootstrap used / Estimating bias with bootstrap
  • Bias Corrected alpha (BCa) confidence interval method / Confidence intervals by bootstrap
  • Big Boss 2 approach / Why the bootstrap works
  • Big Boss approach / Why the bootstrap works
  • bootstrap / Why use simulation?
    • about / The bootstrap, A closer look at the bootstrap
    • motivating example, with odds ratios / A motivating example with odds ratios
    • working / Why the bootstrap works
    • to estimate standard error / Estimation of standard errors with bootstrapping
    • complex estimation, example / An example of a complex estimation using the bootstrap
    • bias, estimating / Estimating bias with bootstrap
    • confidence intervals / Confidence intervals by bootstrap
    • in regression analysis / The bootstrap in regression analysis
    • using / Motivation to use the bootstrap
    • method / The most popular but often worst method
    • by draws from residuals / Bootstrapping by draws from residuals
    • in time series / Bootstrapping in time series
    • in case of complex sampling designs / Bootstrapping in the case of complex sampling designs

C

  • central limit theorem
    • about / The central limit theorem
  • CG method / Further general-purpose optimization methods
  • classes
    • about / Generic functions, methods, and classes
  • classical linear regression model / The classical linear regression model
  • complex models
    • used, for simulating data / Simulating data using complex models
  • Comprehensive R Archive Network (CRAN)
    • about / The R statistical environment
    • reference link / The R statistical environment
  • confidence intervals / Confidence intervals
    • by bootstrap / Confidence intervals by bootstrap
  • congruential generators / Congruential generators
    • linear / Linear and multiplicative congruential generators
    • multiplicative / Linear and multiplicative congruential generators
  • contamination
    • adding / Adding contamination
  • cross-validation
    • about / Cross-validation
    • classical linear regression model / The classical linear regression model
    • basic concept / The basic concept of cross validation
    • classical cross validation / Classical cross validation – 70/30 method
    • leave-one-out cross validation / Leave-one-out cross validation
    • k-fold cross validation / k-fold cross validation

D

  • data
    • simulating, complex methods used / Simulating data using complex models
    • model-based simple example / A model-based simple example
  • data.table package
    • used, for data manipulation / Data manipulation with the data.table package
    • variable construction / data.table – variable construction
    • indexing / data.table – indexing or subsetting
    • subsetting / data.table – indexing or subsetting
    • keys / data.table – keys
    • fast subsetting / data.table – fast subsetting
    • calculations, in groups / data.table – calculations in groups
  • data manipulation
    • in R / Data manipulation in R
    • apply, using / Apply and friends with basic R
    • dplyr package, using / Basic data manipulation with the dplyr package
    • data.table package, using / Data manipulation with the data.table package
  • Data Scientist approach / Why the bootstrap works
  • data types, R
    • about / Data types
    • vectors / Vectors in R
    • factors / Factors in R
    • list / list
    • data.frame / data.frame
    • array / array
  • design-based simulation
    • about / Design-based simulation
    • complex survey data, example / An example with complex survey data
    • synthetic population, simulation / Simulation of the synthetic population
    • interest, estimators / Estimators of interest
    • sampling design, defining / Defining the sampling design
    • stratified sampling, using / Using stratified sampling
    • contamination, adding / Adding contamination
    • performing, separately on different domains / Performing simulations separately on different domains
  • design-based simulation (DBS) / Choosing the right simulation technique
  • design-based simulation studies / Different kinds of simulation and software
  • dplyr package
    • used, for data manipulation / Basic data manipulation with the dplyr package
    • local data frame / dplyr – creating a local data frame
    • selection of lines / dplyr – selecting lines
    • order / dplyr – order
    • selection of columns / dplyr – selecting columns
    • uniqueness / dplyr – uniqueness
    • variables, creating / dplyr – creating variables
    • grouping / dplyr – grouping and aggregates
    • aggregates / dplyr – grouping and aggregates
    • window functions / dplyr – window functions
  • dynamics
    • about / Dynamics in love and hate
  • dynamic systems
    • in ecological modelling / Dynamic systems in ecological modeling

E

  • EM algorithm
    • about / The basic EM algorithm
    • prerequisites / Some prerequisites
    • formal definition / Formal definition of the EM algorithm
    • introductory example / Introductory example for the EM algorithm
    • explaining, by k-means clustering example / The EM algorithm by example of k-means clustering
    • used, for imputation of missing values / The EM algorithm for the imputation of missing values
  • estimators
    • properties / Properties of estimators, Properties of estimators
    • confidence intervals / Confidence intervals
    • robust estimators / A note on robust estimators

F

  • finite populations
    • simulating, with cluster or hierarchical structures / Simulating finite populations with cluster or hierarchical structures
  • Fortran** / High performance computing, Profiling to detect computationally slow functions in code

G

  • generators / More generators
  • generic functions
    • about / Generic functions, methods, and classes, Warm-up example – a high-level plot
  • Gibbs sampler
    • about / The Gibbs sampler
    • two-phase Gibbs sampler / The two-phase Gibbs sampler
    • multiphase Gibbs sampler / The multiphase Gibbs sampler
    • linear regression, application / Application in linear regression
  • gradient ascent/descent method / Gradient ascent/descent
  • graphics package
    • about / The graphics package
    • high-level graphics functions / The graphics package
    • low-level graphics functions / The graphics package
    • interactive functionsTopicn / The graphics package
    • (high-level) plot example / Warm-up example – a high-level plot
    • graphics parameters, controlling / Control of graphics parameters

H

  • high-dimensional data
    • simulating, example / An example of simulating high-dimensional data
  • high-level plot functions / Control of graphics parameters
  • high performance computing
    • about / High performance computing
    • slow functions, detecting with profiling / Profiling to detect computationally slow functions in code
    • benchmarking / Further benchmarking
    • parallel computing / Parallel computing
    • interfaces to C++ / Interfaces to C++

I

  • information visualization
    • about / Visualizing information
    • graphics system, in R / The graphics system in R
    • graphics package / The graphics package
    • package ggplot2 / The ggplot2 package
  • interactive graphics / Visualizing information
  • inversion method / The inversion method

J

  • jackknife
    • about / The jackknife
    • sample / The jackknife
    • disadvantages / Disadvantages of the jackknife
    • delete-d jackknife / The delete-d jackknife
    • after bootstrap / Jackknife after bootstrap

K

  • k-fold cross validation / k-fold cross validation
  • k-means clustering
    • used, for EM algorithm demonstration / The EM algorithm by example of k-means clustering
  • k-Nearest Neighbor (k-NN) / A model-based simulation study

L

  • L-BFGS-B method / Further general-purpose optimization methods
  • leave-one-out cross validation / Leave-one-out cross validation
  • lottery
    • winning / Winning the lottery
  • low-level functions / Control of graphics parameters

M

  • machine numbers
    • and rounding, issues / Machine numbers and rounding problems
    • 64-bit representation, example / Example – the 64-bit representation of numbers
    • convergence / Convergence in the deterministic case
    • convergence, example / Example – convergence
  • Markov chain Monte Carlo (MCMC) / Choosing the right simulation technique
  • Markov chain Monte Carlo (MCMC) methods / What is simulation and where is it applied?
  • Marsaglia
    • URL / Tests for random numbers
  • Mathematician approach / Why the bootstrap works
  • method dispatch / Warm-up example – a high-level plot
  • methods
    • about / Generic functions, methods, and classes
  • Metropolis-Hastings
    • about / Metropolis-Hastings revisited
  • Metropolis Hasting algorithm
    • about / Metropolis - Hastings algorithm
    • Markov chains / A few words on Markov chains
  • Metropolis sampler / The Metropolis sampler
  • micro-simulation / What is simulation and where is it applied?
  • Minimum Covariance Determinant (MCD) algorithm / An example of a complex estimation using the bootstrap
  • missing completely at random (MCAR) / Inserting missing values
  • missing not at random (MNAR) / Inserting missing values
  • missing values
    • imputating, with EM algorithm / The EM algorithm for the imputation of missing values
    • inserting / Inserting missing values
  • mixtures
    • model-based example / A model-based example with mixtures
  • model-based approach
    • to simulate data / Model-based approach to simulate data
  • model-based example
    • with mixtures / A model-based example with mixtures
  • model-based simple example / A model-based simple example
  • model-based simulation (MBS) / Choosing the right simulation technique
  • model-based simulation studies
    • about / Model-based simulation studies, A model-based simulation study
    • latent model example / Latent model example continued
    • example / A simple example of model-based simulation
  • Modgen
    • URL / Agent-based models
  • Monte Carlo simulations
    • about / What is simulation and where is it applied?
    • Bayesian statistics / What is simulation and where is it applied?
    • Markov chain Monte Carlo (MCMC) methods / What is simulation and where is it applied?
    • statistical uncertainty / What is simulation and where is it applied?
    • multi-dimensional integrals / What is simulation and where is it applied?
    • numerical optimization / What is simulation and where is it applied?
    / Choosing the right simulation technique
  • Monte Carlo tests
    • about / Monte Carlo tests
    • motivating example / A motivating example
    • permutation test, as special kind of MC test / The permutation test as a special kind of MC test
    • for multiple groups / A Monte Carlo test for multiple groups
    • Hypothesis testing, bootstrap used / Hypothesis testing using a bootstrap
    • multivariate normality, test for / A test for multivariate normality
    • test, size / Size of the test
    • power comparisons / Power comparisons

N

  • Nelder-Mead method / Further general-purpose optimization methods
  • Newton-Raphson method / Newton-Raphson methods
  • non-uniform distributed random variables, simulation
    • about / Simulation of non-uniform distributed random variables
    • inversion method / The inversion method
    • alias method / The alias method
    • counts in tables, estimation with log-linear models / Estimation of counts in tables with log-linear models
    • rejection sampling / Rejection sampling
    • values, simulating from normal distribution / Simulating values from a normal distribution
    • random numbers, simulating from Beta distribution / Simulating random numbers from a Beta distribution
    • truncated distributions / Truncated distributions
    • Metropolis Hasting algorithm / Metropolis - Hastings algorithm
    • Markov chains / A few words on Markov chains
    • Metropolis sampler / The Metropolis sampler
    • Gibbs sampler / The Gibbs sampler
    • MCMC samples, diagnosis / The diagnosis of MCMC samples
  • numerical optimization
    • about / Numerical optimization
    • gradient ascent/descent method / Gradient ascent/descent
    • Newton-Raphson method / Newton-Raphson methods
    • general-purpose optimization methods / Further general-purpose optimization methods
    • Nelder-Mead method / Further general-purpose optimization methods
    • BFGS method / Further general-purpose optimization methods
    • CG method / Further general-purpose optimization methods
    • L-BFGS-B method / Further general-purpose optimization methods
    • SANN method / Further general-purpose optimization methods

O

  • OpenM++
    • URL / Agent-based models
  • optimization (O) / Choosing the right simulation technique

P

  • parametric bootstrap
    • about / The parametric bootstrap
  • percentile confidence intervals / Confidence intervals by bootstrap
  • plug-in principle
    • about / The plug-in principle
  • probability distributions
    • about / Probability distributions
    • discrete probability distributions / Discrete probability distributions
    • continuous probability distributions / Continuous probability distributions
  • probability theory
    • basics / Some basics on probability theory
  • problems
    • conditions / Condition of problems
  • pseudo random number generators
    • about / Simulating pseudo random numbers
    • arithmetic random number generators / Simulating pseudo random numbers
    • recursive arithmetic random number generators / Simulating pseudo random numbers

R

  • R
    • statistical environment / The R statistical environment
    • about / The R statistical environment
    • basics / Basics in R
    • overview / Some very basic stuff about R
    • installation / Installation and updates
    • installation link / Installation and updates
    • updates / Installation and updates
    • updation link / Installation and updates
    • help option / Help
    • workspace / The R workspace and the working directory
    • working directory / The R workspace and the working directory
    • data types / Data types
    • missing values / Missing values
    • data manipulation / Data manipulation in R
  • random
    • URL / Real random numbers
  • random numbers
    • about / Real random numbers
    • pseudo random numbers, simulating / Simulating pseudo random numbers
    • congruential generators / Congruential generators
    • congruential generators, linear / Linear and multiplicative congruential generators
    • congruential generators, multiplicative / Linear and multiplicative congruential generators
    • lagged Fibonacci generators / Lagged Fibonacci generators
    • generators / More generators
    • testing / Tests for random numbers
    • example / The evaluation of random numbers – an example of a test
  • recursive arithmetic random number generators / Simulating pseudo random numbers
  • reference links / References, References
  • resampling method / Why use simulation?
  • robust estimators / A note on robust estimators
  • R Project
    • reference link / The R statistical environment
  • RStudio
    • reference link / The R statistical environment

S

  • sampling design
    • defining / Defining the sampling design
  • SANN method / Further general-purpose optimization methods
  • simario
    • URL / Agent-based models
  • simulation
    • about / What is simulation and where is it applied?
    • applying, in sampling / What is simulation and where is it applied?
    • micro-simulation / What is simulation and where is it applied?
    • agent-based modeling / What is simulation and where is it applied?
    • Monte Carlo simulations / What is simulation and where is it applied?
    • uses / Why use simulation?
    • and big data / Simulation and big data
    • technique, selecting / Choosing the right simulation technique
    • burning fire simulation, URL / Choosing the right simulation technique
  • simulations
    • types / Different kinds of simulation and software
    • performing, separately on different domains / Performing simulations separately on different domains
  • statistical simulation
    • about / What is simulation and where is it applied?
  • stochastic optimization
    • about / Dealing with stochastic optimization
    • Star Trek / Simplified procedures (Star Trek, Spaceballs, and Spaceballs princess)
    • Spaceballs / Simplified procedures (Star Trek, Spaceballs, and Spaceballs princess)
    • Spaceballs princess / Simplified procedures (Star Trek, Spaceballs, and Spaceballs princess)
    • Metropolis-Hastings / Metropolis-Hastings revisited
    • Gradient-based / Gradient-based stochastic optimization
  • stratified sampling
    • defining / Using stratified sampling
  • synthetic population
    • simulating / Simulation of the synthetic population
  • system dynamics / What is simulation and where is it applied?
  • system dynamics (SD) / Choosing the right simulation technique

V

  • variables / The ggplot2 package
  • vector selection
    • positive way / Vectors in R
    • negative way / Vectors in R
    • logical way / Vectors in R

W

  • weak law of large numbers
    • about / The weak law on large numbers
    • Emperor penguins, and boss / Emperor penguins and your boss
    • random variables, limits / Limits and convergence of random variables
    • random variables, convergence / Limits and convergence of random variables
    • sample mean, convergence / Convergence of the sample mean – weak law of large numbers
    • displaying, by simulation / Showing the weak law of large numbers by simulation
  • window functions
    • about / dplyr – window functions
    • offsets / dplyr – window functions
    • ranking/ordering / dplyr – window functions
    • cumulative functions / dplyr – window functions