Book Image

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil
Book Image

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Overview of this book

<p>For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python.</p> <p><em>Python 2.6 Text Processing Beginner's Guide</em> is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another.</p> <p>The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.</p>
Table of Contents (20 chapters)
Python 2.6 Text Processing Beginner's Guide
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Getting started with Python 3


As we've mentioned, Python 3 is the next major release of the Python programming language. As of the time of writing, the most recent version of Python 3 was 3.1.2. Python 3 aims to clean up a lot of the language cruft that remained through years of backwards-compatible development. That's the good news. The bad news is that a number of the changes made to the language are not compatible. In other words, your code will break. This was the first intentionally backwards-incompatible release.

In this section, we'll highlight some core differences between Python 2 and Python 3. We'll also step through the recommended porting process so you can get a feel as to how to move your code forward. For an overview of the Python 3 development and porting process, you should read PEP3000, available at http://www.python.org/dev/peps/pep-3000/.

Note

Python 3 is a rather clean language and the porting process is not terribly difficult. However, many of the common third-party packages have not yet been ported. If your applications rely on libraries, which are not compatible, you may have to hold off on your upgrade. Or, better yet, perhaps you could donate some of your expertise and help with the effort!

Major language changes

There are some big changes to Python proper that you'll need to understand when moving into Python 3. The Python website has an excellent guide to the changes present in version 3.0. The guide is available at http://docs.python.org/release/3.0.1/whatsnew/3.0.html. This doesn't cover the latest version; however, it does cover the larger major version switch. However, we'll survey some of the major syntactical changes here. It may also be beneficial to read PEP3100, which provides a collection of changes made to the language during the upgrade to version 3. It is available at http://www.python.org/dev/peps/pep-3100/.

Print is now a function

In previous versions, print was a statement. No parentheses were required and you couldn't pass print around as a first class object. That changes with Python 3. The following snippet is valid Python 2 code:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> f = open('outfile', 'w')
>>> print >>f, "The Output"
>>>

Running the preceding code in a Python 3 loop, however, results in an exception bubbling up the call stack and your application terminating. The Python 3 way is as follows.

Python 3.1.2 (r312:79360M, Mar 24 2010, 01:33:18) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> f = open('outfile', 'w')
>>> print('The Output', file=f)
>>>

It will take some time to get used to treating print as a function rather than a statement, but it's worth it. This now allows print to be passed around as a first class object, on par with any user-defined wrappers that would have been used previously.

This change is documented in PEP3105, which is available at http://www.python.org/dev/peps/pep-3105/.

Catching exceptions

Python's syntax for catching exceptions has been changed as well. Previously, programmers would write code similar to the following.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> try:
...     1/0
... except ZeroDivisionError, e:
...     print e
... 
integer division or modulo by zero

This is perfectly valid syntax; however, it often leads to bugs that are not always easy to discover during development. Consider the following code:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> try:
...     1/0
... except ZeroDivisionError, OSError:
...     print "Got an Error"
... 
Got an Error
>>> 

What's wrong with this? Exactly. The developer intends to catch either ZeroDivisionError or OSError. However, that's not how this is treated. Here, we actually assign the value of the caught ZeroDivisionError object to OSError! To eliminate that problem (and awkward syntax), the as keyword is now required in this situation.

Python 3.1.2 (r312:79360M, Mar 24 2010, 01:33:18) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> try:
...     1/0
... except (ZeroDivisionError, OSError) as e:
...     print(e)
... 
int division or modulo by zero
>>>

Attempting to use the syntax in the Python 2 example results in a SyntaxError exception. This ensures there is no ambiguity following the except statement.

Exception changes were proposed in PEP3110. These updates actually made it into the Python 2 series as well. More information is available at http://www.python.org/dev/peps/pep-3110/.

Note

It is acceptable to use the as keyword for exception purposes in Python 2.6 as well. If your code does not need to run on earlier interpreters, you can go right ahead and use the newer syntax now.

Using metaclasses

Metaclasses are a bit of an advanced topic; however, their syntax is worthy of mention. A metaclass is essentially a class that is responsible for building a class. Try not to think about that too hard just yet!

Previous versions of Python enforced a series of rules that would be used to determine what a class' metaclass would be. Programmers could specify one explicitly by inserting an attribute named __metaclass__ into the class definition. It was also possible to do this at the module level, which would cause all defined classes in that module to default to newstyle.

In Python 3, all classes are new style. If you have a need to specify a metaclass, you can now do so via a keyword style argument within the class definition.

Python 3.1.2 (r312:79360M, Mar 24 2010, 01:33:18) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> class UselessMetaclassStatement(metaclass=type):
...     pass
... 
>>> 

The above example, while pointless, illustrates the new syntax.

New reserved words

Subsets of Python tokens are now treated as reserved words and cannot be reassigned. Python 3 adds True, False, None, as, and with. The latter two were reserved as of 2.6 with a warning on None reassignment.

Major library changes

As should be expected, a number of modules in the standard library were updated, added, or removed. Many of them were changed to support proper PEP8-compliant naming conventions. For example, Queue becomes queue and ConfigParser becomes configparser. The list of changes is exhaustive. For a detailed look, see http://www.python.org/dev/peps/pep-3108/, which describes all of the updates.

Changes to list comprehensions

Python's list comprehensions are a powerful feature. They've been changed in Python 3 and generally cleaned up a bit. There are two major changes that you should remember. First, loop control variables are no longer leaked.

For example, the following is valid in Python 2:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> [i for i in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> i
9
>>> 

The above example would result in a NameError under Python 3 when attempting to access i outside of the list compression proper.

Also, the [i for i in 1,2,3] syntax is no longer valid. Literals like this must now be enclosed in parenthesis (making them valid tuples). So, the [i for i in (1,2,3)] should now be used.

Migrating to Python 3

Now, we'll look at the migration process from Python 2 up to Python 3. It's really not as difficult as it sounds! The Python 3 distribution ships with a utility named 2to3, which handles the changes we've outlined below, as well as many others. The recommended update process is as follows:

  1. Ensure you have up-to-date unit tests so that you can validate functionality after you've made all of the required updates.

  2. Under Python 2.6 (or 2.7), run your code with the -3 switch. This enables Python 3-related warnings. Take the time to go through and fix them manually.

  3. Run the 2to3 utility on the updated code once it runs cleanly with a -3.

  4. Manually fix all of your code until your unit tests are again passing, as they should be after any major update.

For more detailed information on the 2to3 utility, see http://docs.python.org/release/3.0.1/library/2to3.html. In the following section, we'll run through the process with some of our example code.

Note

Unit tests are very important in situations like this. Having good unit test coverage ensures that you won't be caught off guard after a major language update like this. We'll skip that step here in our example, but they should always be in place in a production setting.