The strategy we will use in this chapter is to first retrieve a small existing publicly available dataset (Pima Indians diabetes). Then we will perform some basic exploratory analysis, compute some key statistical properties, and then use those properties to simulate a much larger dataset that we will use to input into Spark. The key characteristics that we will use to generate this 'big data' will be:
- The means/standard deviations of the variables: the goal will be to generate means and standard deviations for the large dataset, which are close to the equivalent means and standard deviations of the small dataset.
- The correlations of the variables: since statistical modeling and analysis is largely based upon the association among the variables, the goal of the simulation will be to preserve all of the 2-way correlation numbers for the large dataset which exist in the small dataset.
- The underlying distribution of the variables: we will assume normal distributions...