The datasets we used so far contained a data item in a single line, making it possible for us to use Hadoop default parsing support to parse those datasets. However, some datasets have more complex formats, where a single data item may span multiple lines. In this recipe, we will analyze mailing list archives of Tomcat developers. In the archive, each e-mail consists of multiple lines of the archive file. Therefore, we will write a custom Hadoop InputFormat to process the e-mail archive.
This recipe parses the complex e-mail list archives, and finds the owner (the person who started the thread) and the number of replies received by each e-mail thread.
The following figure shows the execution summary of this computation. The Map function emits the subject of the mail as the key, and the sender's e-mail address combined with the date as the value. Then, Hadoop groups the data by the e-mail subject and sends all the data related to that thread to the same...