Spark is an Apache project that provides an open source computing framework specially geared toward cluster computing. For our purposes, it provides a language called Spark that can be used to access Hadoop information sets.
We install the Spark engine and execute a Spark Jupyter script to show its working, as follows.
Generally, installing Spark involves two steps:
- Installing Spark (for your environment)
- Connecting Spark to your environment (whether standalone or clustered)
The Spark installations are environment specific. I've included the steps to install Spark (in connection with Jupyter) for a Windows environment here. There are different instructions for other environments.
Similarly, Spark relies on a base language to work from. This can be Scala or Python. We automatically have Python as part of the Jupyter installations, so we will rely on Python as the basis. In other words, we will code a Python Notebook, where Python...