In this section, we will look at sampling and filtering RDDs to pick up relevant data points. This is a very powerful concept that allows us to circumvent the limitations of big data and perform our calculations on a particular sample.
Let's now check how sampling not only speeds up our calculations, but also gives us a good approximation of the statistic that we are trying to calculate. To do this, we first import the time library as follows:
from time import time
The next thing we want to do is look at lines or data points in the KDD database that contains the word normal:
raw_data = sc.textFile("./kdd.data.gz")
We need to create a sample of raw_data. We will store the sample into the sample, variable, and we're sampling from raw_data without replacement. We're sampling 10% of the data, and we...