All the corpus readers we've dealt with so far have been file-based. That is in part due to the design of the CorpusReader
base class, and also the assumption that most corpus data will be in text files. However, sometimes you'll have a bunch of data stored in a database that you want to access and use just like a text file corpus. In this recipe, we'll cover the case where you have documents in MongoDB, and you want to use a particular field of each document as your block of text.
MongoDB is a document-oriented database that has become a popular alternative to relational databases such as MySQL. The installation and setup of MongoDB is outside the scope of this book, but you can find instructions at http://docs.mongodb.org/manual/.
You'll also need to install PyMongo, a Python driver for MongoDB. You should be able to do this with either easy_install
or pip
, by typing sudo easy_install pymongo
or sudo pip install pymongo
.
The following code...