We can run an analysis on large text streams, such as news, articles, to attempt to glean important themes. Here we are pulling out bigrams—combinations of two words—that appear in sequence throughout the article.
For this example, I am using text from an online article from Atlantic Monthly called The World Might Be Better Off Without College for Everyone at https://www.theatlantic.com/magazine/archive/2018/01/whats-college-good-for/546590/.
I am using this script:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() sentences = sc.textFile('B09656_09_article.txt') \ .glom() \ .map(lambda x: " ".join(x)) \ .flatMap(lambda x: x.split(".")) print(sentences.count(),"sentences") bigrams = sentences.map(lambda x:x.split()) \ .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)]) print(bigrams.count(),"bigrams") frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \ .map(lambda x:(x[1],x[0])) \ .sortByKey...