One of the tools we will be using is Apache Spark. Spark is an open source toolset for cluster computing. While we will not be using a cluster, the typical usage for Spark is a larger set of machines or cluster that operate in parallel to analyze a big data set. An installation guide is available at https://www.dataquest.io/blog/pyspark-installation-guide. In particular, you will need to add two settings to your bash profile: SPARK_HOME
and PYSPARK_SUBMIT_ARGS
. SPARK_HOME
is the directory where the software is installed. PYSPARK_SUBMIT_ARGS
sets the number of cores to use in the local cluster.
To install, we download the latest TGZ file from the Spark download page at https://spark.apache.org/downloads.html, unpack the TGZ file, and move the unpacked directory to our Applications folder.
Spark relies on Scala's availability. We installed Scala in Chapter 7, Sharing and Converting Jupyter Notebooks.
Open a command-line window to the Spark directory and run this...