Python 2.6 Text Processing: Beginners Guide

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil

Buy this Book

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Buy this Book

Overview of this book

For programmers, working with text is not about reading their newspaper on a break; it's about taking textual data in one form and doing something to it. Extract, decrypt, parse, restructure – these are just some of the text tasks that can occupy much of a programmer's life. If this is your life, this book will make it better – a practical guide on how to do what you want with textual data in Python. Python 2.6 Text Processing Beginner's Guide is the easiest way to learn how to manipulate text with Python. Packed with examples, it will teach you text processing techniques and give you the skills to work with the most popular Python libraries for transforming text from one form to another. The book gets you going with a quick look at some data formats, and installing the supporting libraries and components so that you're ready to get started. You move on to extracting text from a collection of sources and handling it using Python's built-in string functions and regular expressions. You look into processing structured text documents such as XML and HTML, JSON, and CSV. Then you progress to generating documents and creating templates. Finally you look at ways to enhance text output via a collection of third-party packages such as Nucular, PyParsing, NLTK, and Mako.

Python 2.6 Text Processing Beginner's Guide

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Getting Started

Categorizing types of text data

Ensuring you have Python installed

Implementing a simple cipher

Time for action – implementing a ROT13 encoder

Time for action – processing as a filter

Time for action – skipping over markup tags

Supporting third-party modules

Time for action – installing SetupTools

Running a virtual environment

Time for action – configuring a virtual environment

Where to get help?

Summary

Working with the IO System

Parsing web server logs

Time for action – generating transfer statistics

Using objects interchangeably

Time for action – introducing a new log format

Accessing files directly

Time for action – accessing files directly

Time for action – handling compressed files

Accessing multiple files

Time for action – spell-checking HTML content

Accessing remote files

Time for action – spell-checking live HTML pages

Time for action – handling urllib 2 errors

Handling string IO instances

Understanding IO in Python 3

Summary

Python String Services

Understanding the basics of string object

Time for action – employee management

String formatting

Time for action – customizing log processor output

Time for action – adding status code data

Creating templates

Time for action – displaying warnings on malformed lines

Calling string object methods

Time for action – simple manipulation with string methods

Summary

Text Processing Using the Standard Library

Reading CSV data

Time for action – processing Excel formats

Time for action – CSV and formulas

Time for action – processing custom CSV formats

Writing CSV data

Time for action – creating a spreadsheet of UNIX users

Modifying application configuration files

Time for action – adding basic configuration read support

Time for action – relying on configuration value interpolation

Time for action – configuration defaults

Writing configuration data

Time for action – generating a configuration file

Reconfiguring our source

Time for action – creating an egg-based package

Working with JSON

Time for action – writing JSON data

Summary

Regular Expressions

Simple string matching

Time for action – testing an HTTP URL

Advanced pattern matching

Time for action – regular expression grouping

Implementing Python-specific elements

Time for action – reading DNS records

Summary

Structured Markup

XML data

SAX processing

Time for action – event-driven processing

Time for action – driving incremental processing

Time for action – creating a dungeon adventure game

The Document Object Model

Time for action – updating our game to use DOM processing

XPath

Time for action – using XPath in our adventure

Reading HTML

Time for action – displaying links in an HTML page

Summary

Creating Templates

Time for action – installing Mako

Basic Mako usage

Time for action – loading a simple Mako template

Time for action – reformatting the date with Python code

Time for action – defining Mako def tags

Time for action – converting mail message to use namespaces

Inheriting from base templates

Time for action – updating base template

Time for action – adding another inheritance layer

Customizing

Time for action – creating custom Mako tags

Overviewing alternative approaches

Summary

Understanding Encodings and i18n

Understanding basic character encodings

Unicode

Encodings in Python

Time for action – manually decoding

Time for action – copying Unicode data

Time for action – fixing our copy application

The codecs module

Time for action – changing encodings

Adopting good practices

Internationalization and Localization

Time for action – preparing for multiple languages

Time for action – providing translations

Summary

Advanced Output Formats

Dealing with PDF files using PLATYPUS

Time for action – installing ReportLab

Time for action – writing PDF with basic layout and style

Writing native Excel data

Time for action – installing xlwt

Time for action – generating XLS data

Working with OpenDocument files

Time for action – installing ODFPy

Time for action – generating ODT data

Summary

Advanced Parsing and Grammars

Defining a language syntax

PyParsing

Time for action – installing PyParsing

Time for action – implementing a calculator

Time for action – handling type translations

Time for action – suppressing portions of a match

Processing data using the Natural Language Toolkit

Time for action – installing NLTK

Summary

Searching and Indexing

Understanding search complexity

Time for action – implementing a linear search

Text indexing

Time for action – installing Nucular

Time for action – full text indexing

Time for action – measuring index benefit

Time for action – field-qualified indexes

Time for action – performing advanced Nucular queries

Indexing and searching other data

Time for action – indexing Open Office documents

Other index systems

Summary

Looking for Additional Resources

Python resources

Honorable mention

Getting started with Python 3

Time for action – using 2to3 to move to Python 3

Summary

Pop Quiz Answers

Chapter 1: Getting Started

Chapter 2: Working with the IO System

Chapter 3: Python String Services

Chapter 4: Text Processing Using the Standard Library

Chapter 5: Regular Expressions

Chapter 6: Structured Markup

Chapter 7: Creating Templates

Chapter 8: Understanding Encoding and i18n

Chapter 9: Advanced Output Formats

Chapter 11: Searching and Indexing

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

The Python Text Processing Beginner's Guide is intended to provide a gentle, hands-on introduction to processing, understanding, and generating textual data using the Python programming language. Care is taken to ensure the content is example-driven, while still providing enough background information to allow for a solid understanding of the topics covered.

Throughout the book, we use real world examples such as logfile processing and PDF creation to help you further understand different aspects of text handling. By the time you've finished, you'll have a solid working knowledge of both structured and unstructured text data management. We'll also look at practical indexing and character encodings.

A good deal of supporting information is included. We'll touch on packaging, Python IO, third-party utilities, and some details on working with the Python 3 series releases. We'll even spend a bit of time porting a small example application to the latest version.

Finally, we do our best to provide a number of high quality external references. While this book will cover a broad range of topics, we also want to help you dig deeper when necessary.

What this book covers

Chapter 1, Getting Started: This chapter provides an introduction into character and string data types and how strings are represented using underlying integers. We'll implement a simple encoding script to illustrate how text can be manipulated at the character level. We also set up our systems to allow safe third-party library installation.

Chapter 2, Working with the IO System: Here, you'll learn how to access your data. We cover Python's IO capabilities in this chapter. We'll learn how to access files locally and remotely. Finally, we cover how Python's IO layers change in Python 3.

Chapter 3, Python String Services: Covers Python's core string functionality. We look at the methods of string objects, the core template classes, and Python's various string formatting methods. We introduce the differences between Unicode and string objects here.

Chapter 4, Test Processing Using the Standard Library: The standard Python distribution includes a powerful set of built-in libraries designed to manage textual content. We look at configuration file reading and manipulation, CSV files, and JSON data. We take a bit of a detour at the end of this chapter to learn how to create your own redistributable Python egg files.

Chapter 5, Regular Expressions: Looks at Python's regular expression implementation and teaches you how to implement them. We look at standardized concepts as well as Python's extensions. We'll break down a few graphically so that the component parts are easy to piece together. You'll also learn how to safely use regular expressions with international alphabets.

Chapter 6, Structured Markup: Introduces you to XML and HTML processing. We create an adventure game using both SAX and DOM approaches. We also look briefly at lxml and ElementTree. HTML parsing is also covered.

Chapter 7, Creating Templates: Using the Mako template language, we'll generate e-mail and HTML text templates much like the ones that you'll encounter within common web frameworks. We visit template creation, inheritance, filters, and custom tag creation.

Chapter 8, Understanding Encodings and i18n: We provide a look into character encoding schemes and how they work. For reference, we'll examine ASCII as well as KOI8-R. We also look into Unicode and its various encoding mechanisms. Finally, we finish up with a quick look at application internationalization.

Chapter 9, Advanced Output Formats: Provides information on how to generate PDF, Excel, and OpenDocument data. We'll build these document types from scratch using direct Python API calls relying on third-party libraries.

Chapter 10, Advanced Parsing and Grammars: A look at more advanced text manipulation techniques such as those used by programming language designers. We'll use the PyParsing library to handle some configuration file management and look into the Python Natural Language Toolkit.

Chapter 11, Searching and Indexing: A practical look at full text searching and the benefit an index can provide. We'll use the Nucular system to index a collection of small text files and make them quickly searchable.

Appendix A, Looking for Additional Resources: It introduces you to places of interest on the Internet and some community resources. In this appendix, you will learn to create your own documentation and to use Java Lucene based engines. You will also learn about differences between Python 2 & Python 3 and to port code to Python 3.

What you need for this book

This book assumes you've an elementary knowledge of the Python programming language, so we don't provide a tutorial introduction. From a software angle, you'll simply need a version of Python (2.6 or later) installed. Each time we require a third-party library, we'll detail the installation in text.

Who this book is for

If you are a novice Python developer who is interested in processing text then this book is for you. You need no experience with text processing, though basic knowledge of Python would help you to better understand some of the topics covered by this book. As the content of this book develops gradually, you will be able to pick up Python while reading.

Conventions

In this book, you will find several headings appearing frequently.

To give clear instructions of how to complete a procedure or task, we use:

Time for action – heading

Action 1
Action 2
Action 3

Instructions often need some extra explanation so that they make sense, so they are followed with:

What just happened?

This heading explains the working of tasks or instructions that you have just completed.

You will also find some other learning aids in the book, including:

Pop Quiz – heading

These are short multiple choice questions intended to help you test your own understanding.

Have a go hero – heading

These set practical challenges and give you ideas for experimenting with what you have learned.

You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and explanations of their meanings.

Code words in text are shown as follows: "First of all, we imported the re module"

A block of code is set as follows:

parser = OptionParser()
    parser.add_option('-f', '--file', help="CSV Data File")
    opts, args = parser.parse_args()
    if not opts.file:

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

def init_game(self):
        """
        Process World XML.
        """
        self.location = parse(open(self.world)).documentElement

Any command-line input or output is written as follows:

(text_processing)$ python render_mail.py thank_you-e.txt

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Any X found in the source data would simply become an A in the output data.".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or e-mail <[email protected]>.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Tip

Downloading the example code for this book

You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Python 2.6 Text Processing: Beginners Guide

By : Jeff McNeil

Python 2.6 Text Processing: Beginners Guide

By: Jeff McNeil

Overview of this book

Related Content you might be interested in

Current Title:

Python 2.6 Text Processing: Beginners Guide

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop Quiz – heading

Have a go hero – heading

Note

Tip

Reader feedback

Customer support

Tip

Errata

Piracy

Questions