Okay, let's do one more round of improvements on our word-count
script. We need to sort our results of word-count
by something useful. Instead of just having a random list of words associated with how many times they appear, what we want is to see the least used words at the beginning of our list and the most used words at the end. This should give us some actually interesting information to look at. To do this, we're going to need to manipulate our results a little bit more directly-we can't just cheat and use countByValue
and call it done.
So the first thing we're going to do is actually implement what countByValue
does by hand, the hard way. This way we can actually play with the results more directly and stick the results in an RDD instead of just getting a Python object that we need to deal with at that point. The way we do that is we take our map of words-words.map
-and we use a mapper that...