In addition to the Python-based systems we examined throughout the book, there are a number of other high quality systems out there. While not pure Python, many of them provide a means to access data or communicate with a server component in a language agnostic manner. We'll take a look at some of the more common systems here.
We touched on the topic briefly in Chapter 11, Searching and Indexing, but didn't go into very much detail. The Apache Foundation's Lucene project, located at http://lucene.apache.org, is the de facto standard in open source indexing and searching.
The core Lucene project is a Java-based collection that provides file indexing and searching capabilities, much like the Nucular system we looked at. There is a set of Java libraries available for use and command-line tools that may be used without much Java knowledge.
The Lucene project also ships an indexing server named Solr. Solr, on the other hand, is a full-featured search server that runs on top of a Tomcat (or other compliant) application container. Solr exports a rich REST-like XML/JSON API and allows you to index and query against it using any programming language that supports such interaction (Python, of course, is included).
Some of the highlights include:
Rich document handling, such as Microsoft Word or rich text documents.
Full text search with hit highlighting, dynamic clustering, and support for database integration.
Scalability through replication to collections of other Solr servers in order to horizontally disperse load.
Spelling suggestions, support for "more documents like this", field sorting, automatic suggestions, and search results clustering using Carrot2. More information about Carrot2 is available at http://search.carrot2.org/.
A ready-to-use administration interface that includes information such as logging, cache statistics, and replication details.
If you're about to embark upon a project that requires highly scalable search functionality for a variety of different data types, Solr might save you quite a bit of work. The main page is available at http://lucene.apache.org/solr.
Note
There is a Python version of the Lucene engine, named PyLucene. This, however, isn't a direct port of the libraries. Rather, it's a wrapper around the existing Java functionality. This may or may not be suitable to all Python deployments, so we chose not to cover it in this book.
One final note here is that if you're using the Python Java implementation, you can access native Lucene libraries directly from within Python. You can read up on the Java implementation at http://www.jython.org.
Bison is a parser-generator that can be used to generate C-based parse code using an annotated context-free grammar. Bison is compatible with YACC, so if you're familiar, the migration shouldn't be terribly difficult.
Bison allows the developer to define a file, which contains a prologue, an epilogue, and a collection of Bison grammar rules. The general format of a Bison input file is as follows:
%{ Prologue }% Bison Parsing Declarations %% Grammar Rules %% Epilogue
As the output of a Bison run is a C source file, the Prologue is generally used for forward declarations and prototypes, and the Epilogue is used for additional functions that may be used in the processing. A Bison-generated parser must then be compiled and linked in to a C application. GNU Bison documentation is available at http://www.gnu.org/software/bison/.
Note
There is also a Python Lex and Yacc implementation available at http://www.dabeaz.com/ply/. Its self-stated goal is to simply mimic the functionality of standard Lex and Yacc utilities.
Tika is another Apache Java project. The Tika utilities extract structured data from various document types. When processing non-plain-text file types, Lucene relies upon the Tika libraries to extract and normalize data for indexing. Tika is located on the Internet at http://tika.apache.org/.
This is quite a powerful package. In addition to text extraction, Tika supports EXIF data found in images, metadata from MP3, and extraction of information from FLV Flash videos. While not callable directly from CPython, Tika supplies command-line utilities that may be used programmatically via the
subprocess
module.