Book Image

Haskell Data Analysis Cookbook

By : Nishant Shukla
Book Image

Haskell Data Analysis Cookbook

By: Nishant Shukla

Overview of this book

Table of Contents (19 chapters)
Haskell Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Implementing MapReduce to count word frequencies


MapReduce is a framework for efficient parallel algorithms that take advantage of divide and conquer. If a task can be split into smaller tasks, and the results of each individual task can be combined to form the final answer, then MapReduce is likely the best framework for this job.

In the following figure, we can see that a large list is split up, and the mapper functions work in parallel on each split. After all the mapping is complete, the second phase of the framework kicks in, reducing the various calculations into one final answer.

In this recipe, we will be counting word frequencies in a large corpus of text. Given many files of words, we will apply the MapReduce framework to find the word frequencies in parallel.

Getting ready

Install the parallel package using cabal as follows:

$ cabal install parallel

Create multiple files with words. In this recipe, we download a huge text file and split it up using the UNIX split command as follows...