While this is not a big data source, we will show how to get a word count from a text file first. Then we'll find a larger data file to work with.
We can use this script to see the word counts for a file:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() text_file = sc.textFile("B09656_09_word_count.ipynb") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) for x in counts.collect(): print(x)
When we run this in Jupyter, we see something akin to this display:
The display continues for every individual word that was detected in the source file.