Now that we have performed exercises to get you started with analytics, let's review some best practices of using Spark. While Spark provides significant performance improvements compared with Hadoop and MapReduce, we need to be aware of some of the best practices to fully derive the value that Spark affords us:
collectwill try to fetch all the elements in the memory. To validate whether your dataset can be fit into the memory before you used the collect, it is better to use
take(n)so that you control the outcome.
GroupByKeyis not very efficient, as it involves significant shuffling around; use
ReduceByKeyinstead, which aggregates and reduces the amount of shuffling around.
filteras a way of pre-processing to clean up the dataset by dropping bad quality data.
Mapcan be used as a way of pre-processing and imputing values for bad or missing data.