As Python is the most preferred choice for data scientists due to its high-level syntax and extensive library of packages, Spark developers have considered it for data analysis. The PySpark API has been developed for working with RDDs in Python. IPython Notebook is an essential tool for data scientists to present the scientific and theoretical work in an interactive fashion, integrating both text and Python code.
This recipe shows how to configure IPython with PySpark and also focuses on connecting the IPython shell to PySpark.
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. The python --version
command gives the version of the Python installed. If the version seems to be 2.6.x, upgrade it to Python 2.7 as follows:
sudo apt-get install python2.7