We just simulated the positive cases. Now, let's set up some similar code to simulate the non-diabetes patients (outcome=0
).
For the negative cases, we will also multiply sample.bin
by -1
, so that in the future, we know that all the positive sample.bin
instances correspond to positive cases and all the negative sample.bin
instances correspond to negative ones:
set.seed(123) nbins2=base::round(n2/400,0) correlationMatrix <- cor(PimaIndians[PimaIndians$diabetes =='neg',1:8]) covarianceMatrix <- stats::cov(PimaIndians[PimaIndians$diabetes =='neg',1:8]) out_sd2 <- as.DataFrame(data.frame(data.frame( sample.bin=base::sample(1:nbins2,n2,replace=TRUE)*(-1), outcome=0, mvrnorm(n2, mu = means.neg, Sigma = matrix(covarianceMatrix, ncol = 8), empirical = TRUE) ))[rep(1:n2, times=2000), ]) nrow(out_sd2)
The output indicates that 500,000 rows were generated. Notice that it also indicates that two Spark jobs were run to obtain the result. For the...