Overview of RDD actions
As noted in preceding sections, there are two types of Apache Spark RDD operations: transformations and actions. An action returns a value to the driver after running a computation on the dataset, typically on the workers. In the preceding recipes, the take()
and count()
RDD operations are examples of actions.
Getting ready
This recipe will be reading a tab-delimited (or comma-delimited) file, so please ensure that you have a text (or CSV) file available. For your convenience, you can download theairport-codes-na.txt
anddeparturedelays.csv
files from learning http://bit.ly/2nroHbh. Ensure your local Spark cluster can access this file (~/data/flights/airport-codes-na.txt
).
Note
If you are running Databricks, the same file is already included in the /databricks-datasets
folder; the command is
myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda line: line.split("\t"))
Many of the transformations in the next section will use the RDDs airports...