Creating RDDs
For this recipe, we will start creating an RDD by generating the data within the PySpark. To create RDDs in Apache Spark, you will need to first install Spark as shown in the previous chapter. You can use the PySpark shell and/or Jupyter notebook to run these code samples.
Getting ready
We require a working installation of Spark. This means that you would have followed the steps outlined in the previous chapter. As a reminder, to start PySpark shell for your local Spark cluster, you can run this command:
./bin/pyspark --master local[n]
Where n
is the number of cores.
How to do it...
To quickly create an RDD, run PySpark on your machine via the bash terminal, or you can run the same query in a Jupyter notebook. There are two ways to create an RDD in PySpark: you can either use the parallelize()
method—a collection (list or an array of some elements) or reference a file (or files) located either locally or through an external source, as noted in subsequent recipes.
The following code...