Book Image

Modern Python Cookbook

Book Image

Modern Python Cookbook

Overview of this book

Python is the preferred choice of developers, engineers, data scientists, and hobbyists everywhere. It is a great scripting language that can power your applications and provide great speed, safety, and scalability. By exposing Python as a series of simple recipes, you can gain insight into specific language features in a particular context. Having a tangible context helps make the language or standard library feature easier to understand. This book comes with over 100 recipes on the latest version of Python. The recipes will benefit everyone ranging from beginner to an expert. The book is broken down into 13 chapters that build from simple language concepts to more complex applications of the language. The recipes will touch upon all the necessary Python concepts related to data structures, OOP, functional programming, as well as statistical programming. You will get acquainted with the nuances of Python syntax and how to effectively use the advantages that it offers. You will end the book equipped with the knowledge of testing, web services, and configuration and application integration tips and tricks. The recipes take a problem-solution approach to resolve issues commonly faced by Python programmers across the globe. You will be armed with the knowledge of creating applications with flexible logging, powerful configuration, and command-line options, automated unit tests, and good documentation.
Table of Contents (18 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

String parsing with regular expressions


How do we decompose a complex string? What if we have complex, tricky punctuation? Or—worse yet—what if we don't have punctuation, but have to rely on patterns of digits to locate meaningful information?

Getting ready

The easiest way to decompose a complex string is by generalizing the string into a pattern and then writing a regular expression that describes that pattern.

There are limits to the patterns that regular expressions can describe. When we're confronted with deeply-nested documents in a language like HTML, XML, or JSON, we often run into problems, and can't use regular expressions.

The re module contains all of the various classes and functions we need to create and use regular expressions.

Let's say that we want to decompose text from a recipe website. Each line looks like this:

>>> ingredient = "Kumquat: 2 cups"

We want to separate the ingredient from the measurements.

How to do it...

To write and use regular expressions, we often do this:

  1. Generalize the example. In our case, we have something that we can generalize as:
(ingredient words): (amount digits) (unit words)
  1. We've replaced literal text with a two-part summary: what it means and how it's represented. For example, ingredient is represented as words, amount is represented as digits. Import the re module:
>>> import re
  1. Rewrite the pattern into Regular Expression (RE) notation:
>>> pattern_text = r'(?P<ingredient>\w+):\s+(?P<amount>\d+)\s+(?P<unit>\w+)'

We've replaced representation hints such as words with \w+. We've replaced digits with \d+. And we've replaced single spaces with \s+ to allow one or more spaces to be used as punctuation. We've left the colon in place, because in the regular expression notation, a colon matches itself.

For each of the fields of data, we've used ?P<name> to provide a name that identifies the data we want to extract. We didn't do this around the colon or the spaces because we don't want those characters.

REs use a lot of \ characters. To make this work out nicely in Python, we almost always use raw strings. The r' prefix tells Python not to look at the \ characters and not to replace them with special characters that aren't on our keyboards.

  1. Compile the pattern:
>>> pattern = re.compile(pattern_text)
  1. Match the pattern against input text. If the input matches the pattern, we'll get a match object that shows details of the matching:
>>> match = pattern.match(ingredient)>>> match is NoneFalse>>> match.groups()('Kumquat', '2', 'cups')

This, by itself, is pretty cool: we have a tuple of the different fields within the string. We'll return to the use of tuples in a recipe named Using tuples.

  1. Extract the named groups of characters from the match object:
>>> match.group('ingredient')'Kumquat'>>> match.group('amount')'2'>>> match.group('unit')'cups'

Each group is identified by the name we used in the (?P<name>...) part of the RE.

How it works...

There are a lot of different kinds of string patterns that we can describe with RE.

We've shown a number of character classes:

  • \w matches any alphanumeric character (a to z, A to Z, 0 to 9)
  • \d matches any decimal digit
  • \s matches any space or tab character

These classes also have inverses:

  • \W matches any character that's not a letter or a digit
  • \D matches any character that's not a digit
  • \S matches any character that's not some kind of space or tab

Many characters match themselves. Some characters, however, have special meaning, and we have to use \ to escape from that special meaning:

  • We saw that + as a suffix means to match one or more of the preceeding patterns. \d+ matches one or more digits. To match an ordinary +, we need to use \+.
  • We also have * as a suffix which matches zero or more of the preceding patterns. \w* matches zero or more characters. To match a *, we need to use \*.
  • We have ? as a suffix which matches zero or one of the preceding expressions. This character is used in other places, and has a slightly different meaning. We saw it in (?P<name>...) where it was inside the () to define special properties for the grouping.
  • The . matches any single character. To match a . specifically, we need to use \.

We can create our own unique sets of characters using [] to enclose the elements of the set. We might have something like this:

    (?P<name>\w+)\s*[=:]\s*(?P<value>.*)

This has a \w+ to match any number of alphanumeric characters. This will be collected into a group with the name of name.

It uses \s* to match an optional sequence of spaces.

It matches any character in the set [=:]. One of the two characters in this set must be present.

It uses \s* again to match an optional sequence of spaces.

Finally, it uses .* to match everything else in the string. This is collected into a group named value.

We can use this to parse strings like this:

    size = 12 
    weight: 14

By being flexible with the punctuation, we can make a program easier to use. We'll tolerate any number of spaces, and either an = or a : as a separator.

There's more...

A long regular expression can be awkward to read. We have a clever Pythonic trick for presenting an expression in a way that's much easier to read:

>>> ingredient_pattern = re.compile(... r'(?P<ingredient>\w+):\s+' # name of the ingredient up to the ":"... r'(?P<amount>\d+)\s+'      # amount, all digits up to a space... r'(?P<unit>\w+)'           # units, alphanumeric characters... )

This leverages three syntax rules:

  • A statement isn't finished until the () characters match
  • Adjacent string literals are silently concatenated into a single long string
  • Anything between # and the end of the line is a comment, and is ignored

We've put Python comments after the important clauses in our regular expression. This can help us understand what we did, and perhaps help us diagnose problems later.

See also