Processing data with PySpark
Before processing data with PySpark, let's run one of the samples to show how Spark works. Then, we will skip the boilerplate in later examples and focus on data processing. The Jupyter notebook for the Pi Estimation example from the Spark website at http://spark.apache.org/examples.html is shown in the following screenshot:
- The first cell imports
findsparkand runs the
init()method. This was explained in the preceding section as the preferred method to include PySpark in Jupyter notebooks. The code is as follows:
import findspark findspark.init()
- The next cell imports the
SparkSession. It then creates the session by passing the head node of the Spark cluster. You can get the URL from the Spark web UI...