Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins
Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Table of Contents (17 chapters)
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Penn Treebank Part-of-speech Tags
Index

Creating a MongoDB-backed corpus reader


All the corpus readers we've dealt with so far have been file-based. That is in part due to the design of the CorpusReader base class, and also the assumption that most corpus data will be in text files. However, sometimes you'll have a bunch of data stored in a database that you want to access and use just like a text file corpus. In this recipe, we'll cover the case where you have documents in MongoDB, and you want to use a particular field of each document as your block of text.

Getting ready

MongoDB is a document-oriented database that has become a popular alternative to relational databases such as MySQL. The installation and setup of MongoDB is outside the scope of this book, but you can find instructions at http://docs.mongodb.org/manual/.

You'll also need to install PyMongo, a Python driver for MongoDB. You should be able to do this with either easy_install or pip, by typing sudo easy_install pymongo or sudo pip install pymongo.

The following code...