Proceed to create our test and train datasets. The objective will be to sample 80% of the data for the training set and 20% of the data for the test data set.
To speed up sampling somewhat, we can sequentially sample the tails of the sample_bin
range for the test dataset and then use the middle for the training data. This is still a random sample, since sample_bin
was originally generated randomly and the sequence or range of the numbers have no bearing on the randomness.
Since we want 80% of our data to be training data, first take all of the sample_bin
numbers which lie between the high and low cutoff values. We can define the cutoff range as 20% of the difference between the highest and lowest value of sample_bin
.
Set the low cutoff as the lowest value plus the cutoff range defined previously, and the high cutoff as the highest value minus the cutoff range:
#compute the minimum and maximum values of sample bin...