In this section, we are going to look at splitting datasets and creating new combinations with set operations. We're going to learn subtracts, and Cartesian ones in particular.
Let's go back to Chapter 3 of the Jupyter Notebook that we've been looking at lines in the datasets that contain the word normal. Let's try to get all the lines that don't contain the word normal. One way is to use the filter function to look at lines that don't have normal in it. But, we can use something different in PySpark: a function called subtract to take the entire dataset and subtract the data that contains the word normal. Let's have a look at the following snippet:
normal_sample = sampled.filter(lambda line: "normal." in line)
We can then obtain interactions or data points that don't contain...