Now that we have seen some of the functionality, let's explore further. We can use a similar script to count the word occurrences in a file, as follows:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() text_file = sc.textFile("Spark File Words.ipynb") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) for x in counts.collect(): print x
We have the same preamble to the coding. Then we load the text file into memory.
Once the file is loaded, we split each line into words. Use a lambda
function to tick off each occurrence of a word. The code is truly creating a new record for each word occurrence. If a word appears in the stream, a record with the count of 1
is added for that word and for every other instance the word appears, new records with the same count of 1
are added. The idea is that this process could be split over multiple processors, where each...