Now that we have a word count, the more interesting use is to sort them by occurrence to determine the highest usage.
We can slightly modify the previous script to produce a sorted listed as follows:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() text_file = sc.textFile("B09656_09_word_count.ipynb") sorted_counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .sortByKey() for x in sorted_counts.collect(): print(x)
Producing the output as follows:
The list continues for every word found. Notice the descending order of occurrences and the sorting with words of the same occurrence. What Spark uses to determine word breaks does not appear to be too good.