Munging textual data
In this section, we explore data munging techniques for typical analysis situations. Many text-based analyses tasks require computing word counts, removing stop words, stemming, and so on. In addition, we will also explore how you can process multiple files, one at a time, from HDFS directories.
First, we import all the classes that will be used in this section:
Processing multiple input data files
In the next few steps, we initialize a set of variables for defining the directory containing the input files, and an empty RDD. We also create a list of filenames the input HDFS directory. In the following example, we will work with files contained in a single directory; however, the techniques can easily be extended across all 20 newsgroup sub-directories.
Next, we write a function to compute the word counts for each file and collect the results in an ArrayBuffer
:
We have included a print statement to display the file names as they are picked up for processing, as follows:
We...