Actions, in contrast to transformations, execute the scheduled task on the dataset; once you have finished transforming your data you can execute your transformations. This might contain no transformations (for example, .take(n)
will just return n
records from an RDD even if you did not do any transformations to it) or execute the whole chain of transformations.
This is most arguably the most useful (and used, such as the .map(...)
method). The method is preferred to .collect(...)
as it only returns the n
top rows from a single data partition in contrast to .collect(...)
, which returns the whole RDD. This is especially important when you deal with large datasets:
data_first = data_from_file_conv.take(1)
If you want somewhat randomized records you can use .takeSample(...)
instead, which takes three arguments: First whether the sampling should be with replacement, the second specifies the number of records to return, and the third is a seed to the pseudo-random numbers...