In this example, we'll create a full text index using Nucular for our large set of data. We'll use the same comp.lang.python
messages as used previously, which are available via the Packt Publishing FTP site. We'll only index individual months at a time in order to keep our examples manageable. In aggregate, that gives us over 85,000 files to work with totaling up to 315 MB of raw text data.
In creating a full text index, we won't separate each message out into its component parts. All of the text for each message will become a single attribute within each Nucular entry.
Create a new file and name it as
clp_index.py
. We'll use this to generate our index. Enter the following code:import os from optparse import OptionParser from nucular import Nucular def index_contents(session, where, persist_every=100): """Index a directory at a time.""" for c, i in enumerate(os.listdir(where)): full_path = os.path.join(where, i) print 'indexing...