Python 2.6 Text Processing: Beginners Guide

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil

Buy this Book

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Buy this Book

Overview of this book

For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python. Python 2.6 Text Processing Beginner's Guide is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another. The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.

Python 2.6 Text Processing Beginner's Guide

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Getting Started

Categorizing types of text data

Ensuring you have Python installed

Implementing a simple cipher

Time for action – implementing a ROT13 encoder

Time for action – processing as a filter

Time for action – skipping over markup tags

Supporting third-party modules

Time for action – installing SetupTools

Running a virtual environment

Time for action – configuring a virtual environment

Where to get help?

Summary

Working with the IO System

Parsing web server logs

Time for action – generating transfer statistics

Using objects interchangeably

Time for action – introducing a new log format

Accessing files directly

Time for action – accessing files directly

Time for action – handling compressed files

Accessing multiple files

Time for action – spell-checking HTML content

Accessing remote files

Time for action – spell-checking live HTML pages

Time for action – handling urllib 2 errors

Handling string IO instances

Understanding IO in Python 3

Summary

Python String Services

Understanding the basics of string object

Time for action – employee management

String formatting

Time for action – customizing log processor output

Time for action – adding status code data

Creating templates

Time for action – displaying warnings on malformed lines

Calling string object methods

Time for action – simple manipulation with string methods

Summary

Text Processing Using the Standard Library

Reading CSV data

Time for action – processing Excel formats

Time for action – CSV and formulas

Time for action – processing custom CSV formats

Writing CSV data

Time for action – creating a spreadsheet of UNIX users

Modifying application configuration files

Time for action – adding basic configuration read support

Time for action – relying on configuration value interpolation

Time for action – configuration defaults

Writing configuration data

Time for action – generating a configuration file

Reconfiguring our source

Time for action – creating an egg-based package

Working with JSON

Time for action – writing JSON data

Summary

Regular Expressions

Simple string matching

Time for action – testing an HTTP URL

Advanced pattern matching

Time for action – regular expression grouping

Implementing Python-specific elements

Time for action – reading DNS records

Summary

Structured Markup

XML data

SAX processing

Time for action – event-driven processing

Time for action – driving incremental processing

Time for action – creating a dungeon adventure game

The Document Object Model

Time for action – updating our game to use DOM processing

XPath

Time for action – using XPath in our adventure

Reading HTML

Time for action – displaying links in an HTML page

Summary

Creating Templates

Time for action – installing Mako

Basic Mako usage

Time for action – loading a simple Mako template

Time for action – reformatting the date with Python code

Time for action – defining Mako def tags

Time for action – converting mail message to use namespaces

Inheriting from base templates

Time for action – updating base template

Time for action – adding another inheritance layer

Customizing

Time for action – creating custom Mako tags

Overviewing alternative approaches

Summary

Understanding Encodings and i18n

Understanding basic character encodings

Unicode

Encodings in Python

Time for action – manually decoding

Time for action – copying Unicode data

Time for action – fixing our copy application

The codecs module

Time for action – changing encodings

Adopting good practices

Internationalization and Localization

Time for action – preparing for multiple languages

Time for action – providing translations

Summary

Advanced Output Formats

Dealing with PDF files using PLATYPUS

Time for action – installing ReportLab

Time for action – writing PDF with basic layout and style

Writing native Excel data

Time for action – installing xlwt

Time for action – generating XLS data

Working with OpenDocument files

Time for action – installing ODFPy

Time for action – generating ODT data

Summary

Advanced Parsing and Grammars

Defining a language syntax

PyParsing

Time for action – installing PyParsing

Time for action – implementing a calculator

Time for action – handling type translations

Time for action – suppressing portions of a match

Processing data using the Natural Language Toolkit

Time for action – installing NLTK

Summary

Searching and Indexing

Understanding search complexity

Time for action – implementing a linear search

Text indexing

Time for action – installing Nucular

Time for action – full text indexing

Time for action – measuring index benefit

Time for action – field-qualified indexes

Time for action – performing advanced Nucular queries

Indexing and searching other data

Time for action – indexing Open Office documents

Other index systems

Summary

Looking for Additional Resources

Python resources

Honorable mention

Getting started with Python 3

Time for action – using 2to3 to move to Python 3

Summary

Pop Quiz Answers

Chapter 1: Getting Started

Chapter 2: Working with the IO System

Chapter 3: Python String Services

Chapter 4: Text Processing Using the Standard Library

Chapter 5: Regular Expressions

Chapter 6: Structured Markup

Chapter 7: Creating Templates

Chapter 8: Understanding Encoding and i18n

Chapter 9: Advanced Output Formats

Chapter 11: Searching and Indexing

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Honorable mention

In addition to the Python-based systems we examined throughout the book, there are a number of other high quality systems out there. While not pure Python, many of them provide a means to access data or communicate with a server component in a language agnostic manner. We'll take a look at some of the more common systems here.

Lucene and Solr

We touched on the topic briefly in Chapter 11, Searching and Indexing, but didn't go into very much detail. The Apache Foundation's Lucene project, located at http://lucene.apache.org, is the de facto standard in open source indexing and searching.

The core Lucene project is a Java-based collection that provides file indexing and searching capabilities, much like the Nucular system we looked at. There is a set of Java libraries available for use and command-line tools that may be used without much Java knowledge.

The Lucene project also ships an indexing server named Solr. Solr, on the other hand, is a full-featured search server that runs on top of a Tomcat (or other compliant) application container. Solr exports a rich REST-like XML/JSON API and allows you to index and query against it using any programming language that supports such interaction (Python, of course, is included).

Some of the highlights include:

Rich document handling, such as Microsoft Word or rich text documents.
Full text search with hit highlighting, dynamic clustering, and support for database integration.
Scalability through replication to collections of other Solr servers in order to horizontally disperse load.
Spelling suggestions, support for "more documents like this", field sorting, automatic suggestions, and search results clustering using Carrot2. More information about Carrot2 is available at http://search.carrot2.org/.
A ready-to-use administration interface that includes information such as logging, cache statistics, and replication details.

If you're about to embark upon a project that requires highly scalable search functionality for a variety of different data types, Solr might save you quite a bit of work. The main page is available at http://lucene.apache.org/solr.

Note

There is a Python version of the Lucene engine, named PyLucene. This, however, isn't a direct port of the libraries. Rather, it's a wrapper around the existing Java functionality. This may or may not be suitable to all Python deployments, so we chose not to cover it in this book.

One final note here is that if you're using the Python Java implementation, you can access native Lucene libraries directly from within Python. You can read up on the Java implementation at http://www.jython.org.

Generating C-based parsers with GNU Bison

Bison is a parser-generator that can be used to generate C-based parse code using an annotated context-free grammar. Bison is compatible with YACC, so if you're familiar, the migration shouldn't be terribly difficult.

Bison allows the developer to define a file, which contains a prologue, an epilogue, and a collection of Bison grammar rules. The general format of a Bison input file is as follows:

%{
  Prologue
}%

Bison Parsing Declarations

%%
Grammar Rules
%%

Epilogue

As the output of a Bison run is a C source file, the Prologue is generally used for forward declarations and prototypes, and the Epilogue is used for additional functions that may be used in the processing. A Bison-generated parser must then be compiled and linked in to a C application. GNU Bison documentation is available at http://www.gnu.org/software/bison/.

Note

There is also a Python Lex and Yacc implementation available at http://www.dabeaz.com/ply/. Its self-stated goal is to simply mimic the functionality of standard Lex and Yacc utilities.

Apache Tika

Tika is another Apache Java project. The Tika utilities extract structured data from various document types. When processing non-plain-text file types, Lucene relies upon the Tika libraries to extract and normalize data for indexing. Tika is located on the Internet at http://tika.apache.org/.

This is quite a powerful package. In addition to text extraction, Tika supports EXIF data found in images, metadata from MP3, and extraction of information from FLV Flash videos. While not callable directly from CPython, Tika supplies command-line utilities that may be used programmatically via the subprocess module.

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Overview of this book

Related Content you might be interested in

Current Title:

Python 2.6 Text Processing: Beginners Guide

Honorable mention

Lucene and Solr

Note

Generating C-based parsers with GNU Bison

Note

Apache Tika