In this section, we will learn more about DataFrames and learn how to use Spark SQL.
The Spark SQL interface is very simple. For this reason, taking away labels means that we are in unsupervised learning territory. Also, Spark has great support for clustering and dimensionality reduction algorithms. We can tackle learning problems effectively by using Spark SQL to give big data a structure.
Let's take a look at the code that we will be using in our Jupyter Notebook. To maintain consistency, we will be using the same KDD cup data:
- We will first type textFile into a raw_data variable as follows:
raw_data = sc.textFile("./kddcup.data.gz")
- What's new here is that we are importing two new packages from pyspark.sql:
- Row
- SQLContext
- The following code shows us how to import these packages:
from pyspark.sql import...