We can use MapReduce in another example where we get the word counts from a file. A standard problem, but we use MapReduce to do most of the heavy lifting. We can use the source code for this example. We can use a script similar to this to count the word occurrences in a file:
import pysparkif not 'sc' in globals(): sc = pyspark.SparkContext()text_file = sc.textFile("Spark File Words.ipynb")counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)for x in counts.collect(): print x
Then we load the text file into memory.
It is assumed to be massive and the contents distributed over many handlers.
Once the file is loaded we split each line into words, and then use a lambda
function to tick off each occurrence of a word. The code is truly creating a new record...