Book Image

Mastering Python for Finance

Book Image

Mastering Python for Finance

Overview of this book

Table of Contents (17 chapters)
Mastering Python for Finance
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

A word count program in Hadoop


Perhaps the simplest way to get started with understanding programming for Hadoop is a simple word count functionality on a fairly large electronic book. The map program will read in every line of the text separated by a space or tab and return a key-value pair, which is by default assigned to a count of 1. The reduce program will read in all key-value pairs from the map program and sum up the number of similar words. Hadoop will produce an output file that contains a list of words in the book and the number of times the words have appeared.

Downloading sample data

Project Gutenberg hosts over 100,000 free e-books in HTML, EPUB, Kindle, and plain-text UTF-8 formats. For our testing with a sample e-book, let's use Ulysses by James Joyce. The link for the plain text UTF-8 file is http://www.gutenberg.org/ebooks/4300.txt.utf-8. Using Firefox or any other web browser available in the CentOS virtual machine, you can download the file from the URL, and save it as pg4300...