Book Image

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil
Book Image

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Overview of this book

<p>For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python.</p> <p><em>Python 2.6 Text Processing Beginner's Guide</em> is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another.</p> <p>The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.</p>
Table of Contents (20 chapters)
Python 2.6 Text Processing Beginner's Guide
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Python resources


First and foremost, the Python standard documentation is a wonderful tool and stands to help you with just about any project. Python is known for its batteries included approach. In other words, there are a lot of common utilities that reside in the Python standard library whereas a third-party extension might be required for a different language. The main Python documentation page can be found at http://docs.python.org.

If you're new to Python then the Python.org tutorial is highly recommended. The tutorial provides an up-to-date introduction to the language. It is kept in lockstep with major releases of the language, so you're certain to cover up-to-date material.

Previous versions of both the standard library reference and the official tutorial are also available, so if the version of Python you happen to have on your system is older than the latest available release, you can access the corresponding documentation.

Unofficial documentation

Mark Pilgrim's Dive into Python is available online, free of charge, and can be purchased in paperback format. This serves as a comprehensive guide to the language. The text is available online at http://www.diveintopython.org. If you're interested in Python 3 specifically, Packt Publishing's Python 3 Object Oriented Programming is a great book to add to your collection.

If you're not fully familiar with the standard library yet, another good resource is Doug Hellmann's Python Module of the Week series in which he dives into each standard library in detail. Doug's series can be found at http://www.doughellmann.com/projects/PyMOTW/. Familiarizing yourself with the standard library can help you avoid reinventing the wheel in your own projects.

Python enhancement proposals

We've referenced a few PEP documents throughout this book, but we haven't gone into much detail as to what they are. Whenever a core change is made to the language or its supporting cast (libraries and so on), the change usually goes through a proposal process. The advocate for change authors a Python Enhancement Proposal, which is then presented to the appropriate audience for inclusion or dismissal. The PEP index, identified as PEP 0, can be found at http://www.python.org/dev/peps/. Some of the more string- and text-related proposals are as follows:

This, of course, is not an exhaustive list. PEPs, while informative, are also geared towards the language developer. They can be a helpful resource, but they're usually written with language developers in mind.

Note

Throughout this book, we've tried to adhere to Python's PEP8 for style guidelines. PEP8 provides style rules for Python code specifically in the Python library. However, it's become the community standard. PEP8 can be found at http://www.python.org/dev/peps/pep-8/.

Self-documenting

In addition to standard developer documentation, Python is quite self-documenting. Good programming practice dictates that developers specify documentation strings at class, method, module, and function level. By doing this, they ensure that API documentation is always kept up to date.

Note

For an overview of Python's documentation string standards, see PEP (Python Enhancement Proposal) 257, which is available at http://www.python.org/dev/peps/pep-0257/. Note that these are the guidelines Python itself uses. You're free to invent your own organizational guidelines.

Doc strings, of course, translate directly into usable Python documentation. For example, the help function provides a mechanism for you to display doc string generated content from within the interactive REPL loop.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import os
>>> help(os)

The preceding example generates a help page that resembles a standard UNIX manual page. It also provides some extended information on the attributes and inheritance hierarchy that is available via introspection routines.

Additionally, most Python installs include the pydoc command, which provides a command-line method to access the same help data.

Both online help methods are not limited to library documentation. It is also possible to use either method to display information regarding Python's keywords and topics. For example, let's look at the following command line:

(text_processing)$ pydoc if

That statement generates output that resembles the following. It may differ depending on the location of your installed Python and the operating system you're using.

You may also notice, when you exit the previous screen, that pydoc prints recommended help topics to the screen. In this case, it will print Related help topics: TRUTHVALUE. This provides a nice way to introduce you to other topics that are related to the exact search keyword.

Note

The pydoc script itself also allows you to run a local web server, search by keyword, generate flat HTML output, or search via a graphical interface. Simply running the command without any arguments will display usage information.

Using other documentation tools

In addition to the built-in help and pydoc systems, there are third-party utilities that can be used to generate more in-depth API documentation:

  • The Sphinx documentation system. This is a more advanced system that allows you to provide raw documentation in addition to your documentation strings. Sphinx can then be configured to extract source documentation via settings. More information is available at http://sphinx.pocoo.org/.

  • The Doxygen system, available at http://www.stack.nl/~dimitri/doxygen/, works with a variety of languages and supports a variety of output format. It can also be used to extract source-based documentation.

  • The epydoc package, which is available at http://epydoc.sourceforge.net/. This package uses a lightweight markup to generate detailed package documentation. This is similar to the Javadoc system.

These are wonderful utilities, though their intent is more to document your own code rather than view standard Python documentation.

Community resources

Python comes with a collection of useful modules and libraries and a world-class community. There are numerous ways you can interact, both in requesting help and providing guidance. Let's take a look at the options.

Following groups and mailing lists

There are mailing lists and groups out there for general Python usage, beginners' questions, and special interest groups. Available lists are detailed at http://www.python.org/community/lists/. Let's outline a few of the more popular ones here.

  • The comp.lang.python group is the main high-traffic Python discussion group. This is a somewhat high volume group where experienced Python developers discuss problems, designs, and answer questions throughout the day. This is a wonderful resource. It is possible to access this group via Google Groups such that you don't have to manage the e-mail volume.

  • The Python-tutor mailing list is designed to be a place for beginners to ask questions that may be less-than-welcome on the comp.lang.python group. For example, it's also a wonderful place to lend your expertise and help others learn the technology.

  • Python-Dev is where development of the Python language takes place. This is not for questions related to development in Python; rather, this is for development of Python.

  • Python-Help is a rather interesting list. You may send Python-related questions to this mailbox and it will be monitored by a set of volunteers. They may, depending on the experience level, address your question in private.

  • The Python Papers Anthology, available at http://www.pythonpapers.org, is a thorough collection of industry and academic documents available on the web. Their goal is to disseminate information regarding Python technologies and their application.

In addition to the standard mailing lists, there is a collection of Special Interest Groups that narrows down into yet more specific territory. SIGs are formed to address and maintain a specific area of Python. Membership is informal. For a list of all of the active SIGs, see the main page at http://www.python.org/community/sigs/.

Finding a users' group

Python users' groups are local organizations that are managed by local individuals that share a common interest in the Python programming language. Generally, users' groups hold a meeting on a re-occurring schedule and encourage discussion and information-sharing between members. This is a wonderful way to get involved with the Python community, make friends, and learn about a specific area of the language you may not be familiar with. There are two resources for finding Python users' groups.

First, http://wiki.python.org/moin/LocalUserGroups, which provides a list of groups broken down by geographic region. Second, http://www.meetup.com/ is also a great resource. The http://python.meetup.com/ site provides a listing of scheduled Python-related meetups in your local area.

These are also great places to try on your speaking and presentation skills with a friendly, tolerant, and eager audience.

Attending a local Python conference

Each year, various large-scale Python conferences are held all over the planet. These are highly technical events. While vendors are present, the focus is on Python technology discussion. For information about the various Python conferences, see http://www.pycon.org. These events are packed full of tutorials, sessions, and coding sprints. Volunteers within the Python community put these conferences together.