Python 2.6 Text Processing: Beginners Guide

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil

Buy this Book

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Buy this Book

Overview of this book

For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python. Python 2.6 Text Processing Beginner's Guide is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another. The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.

Python 2.6 Text Processing Beginner's Guide

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Getting Started

Categorizing types of text data

Ensuring you have Python installed

Implementing a simple cipher

Time for action – implementing a ROT13 encoder

Time for action – processing as a filter

Time for action – skipping over markup tags

Supporting third-party modules

Time for action – installing SetupTools

Running a virtual environment

Time for action – configuring a virtual environment

Where to get help?

Summary

Working with the IO System

Parsing web server logs

Time for action – generating transfer statistics

Using objects interchangeably

Time for action – introducing a new log format

Accessing files directly

Time for action – accessing files directly

Time for action – handling compressed files

Accessing multiple files

Time for action – spell-checking HTML content

Accessing remote files

Time for action – spell-checking live HTML pages

Time for action – handling urllib 2 errors

Handling string IO instances

Understanding IO in Python 3

Summary

Python String Services

Understanding the basics of string object

Time for action – employee management

String formatting

Time for action – customizing log processor output

Time for action – adding status code data

Creating templates

Time for action – displaying warnings on malformed lines

Calling string object methods

Time for action – simple manipulation with string methods

Summary

Text Processing Using the Standard Library

Reading CSV data

Time for action – processing Excel formats

Time for action – CSV and formulas

Time for action – processing custom CSV formats

Writing CSV data

Time for action – creating a spreadsheet of UNIX users

Modifying application configuration files

Time for action – adding basic configuration read support

Time for action – relying on configuration value interpolation

Time for action – configuration defaults

Writing configuration data

Time for action – generating a configuration file

Reconfiguring our source

Time for action – creating an egg-based package

Working with JSON

Time for action – writing JSON data

Summary

Regular Expressions

Simple string matching

Time for action – testing an HTTP URL

Advanced pattern matching

Time for action – regular expression grouping

Implementing Python-specific elements

Time for action – reading DNS records

Summary

Structured Markup

XML data

SAX processing

Time for action – event-driven processing

Time for action – driving incremental processing

Time for action – creating a dungeon adventure game

The Document Object Model

Time for action – updating our game to use DOM processing

XPath

Time for action – using XPath in our adventure

Reading HTML

Time for action – displaying links in an HTML page

Summary

Creating Templates

Time for action – installing Mako

Basic Mako usage

Time for action – loading a simple Mako template

Time for action – reformatting the date with Python code

Time for action – defining Mako def tags

Time for action – converting mail message to use namespaces

Inheriting from base templates

Time for action – updating base template

Time for action – adding another inheritance layer

Customizing

Time for action – creating custom Mako tags

Overviewing alternative approaches

Summary

Understanding Encodings and i18n

Understanding basic character encodings

Unicode

Encodings in Python

Time for action – manually decoding

Time for action – copying Unicode data

Time for action – fixing our copy application

The codecs module

Time for action – changing encodings

Adopting good practices

Internationalization and Localization

Time for action – preparing for multiple languages

Time for action – providing translations

Summary

Advanced Output Formats

Dealing with PDF files using PLATYPUS

Time for action – installing ReportLab

Time for action – writing PDF with basic layout and style

Writing native Excel data

Time for action – installing xlwt

Time for action – generating XLS data

Working with OpenDocument files

Time for action – installing ODFPy

Time for action – generating ODT data

Summary

Advanced Parsing and Grammars

Defining a language syntax

PyParsing

Time for action – installing PyParsing

Time for action – implementing a calculator

Time for action – handling type translations

Time for action – suppressing portions of a match

Processing data using the Natural Language Toolkit

Time for action – installing NLTK

Summary

Searching and Indexing

Understanding search complexity

Time for action – implementing a linear search

Text indexing

Time for action – installing Nucular

Time for action – full text indexing

Time for action – measuring index benefit

Time for action – field-qualified indexes

Time for action – performing advanced Nucular queries

Indexing and searching other data

Time for action – indexing Open Office documents

Other index systems

Summary

Looking for Additional Resources

Python resources

Honorable mention

Getting started with Python 3

Time for action – using 2to3 to move to Python 3

Summary

Pop Quiz Answers

Chapter 1: Getting Started

Chapter 2: Working with the IO System

Chapter 3: Python String Services

Chapter 4: Text Processing Using the Standard Library

Chapter 5: Regular Expressions

Chapter 6: Structured Markup

Chapter 7: Creating Templates

Chapter 8: Understanding Encoding and i18n

Chapter 9: Advanced Output Formats

Chapter 11: Searching and Indexing

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Time for action – full text indexing

In this example, we'll create a full text index using Nucular for our large set of data. We'll use the same comp.lang.python messages as used previously, which are available via the Packt Publishing FTP site. We'll only index individual months at a time in order to keep our examples manageable. In aggregate, that gives us over 85,000 files to work with totaling up to 315 MB of raw text data.

In creating a full text index, we won't separate each message out into its component parts. All of the text for each message will become a single attribute within each Nucular entry.

Create a new file and name it as clp_index.py. We'll use this to generate our index. Enter the following code:

import os
from optparse import OptionParser
from nucular import Nucular

def index_contents(session, where, persist_every=100):
    """Index a directory at a time."""
    for c, i in enumerate(os.listdir(where)):
        full_path = os.path.join(where, i)
        print 'indexing...

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Overview of this book

Related Content you might be interested in

Current Title:

Python 2.6 Text Processing: Beginners Guide

Time for action – full text indexing