Book Image

Mastering Python Regular Expressions

Book Image

Mastering Python Regular Expressions

Overview of this book

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. They are considered the Swiss army knife of text processing. Powerful search, replacement, extraction and validation of strings, repetitive and complex tasks are reduced to a simple pattern using regular expressions. Mastering Python Regular Expressions will teach you about Regular Expressions, starting from the basics, irrespective of the language being used, and then it will show you how to use them in Python. You will learn the finer details of what Python supports and how to do it, and the differences between Python 2.x and Python 3.x. The book starts with a general review of the theory behind the regular expressions to follow with an overview of the Python regex module implementation, and then moves on to advanced topics like grouping, looking around, and performance. You will explore how to leverage Regular Expressions in Python, some advanced aspects of Regular Expressions and also how to measure and improve their performance. You will get a better understanding of the working of alternators and quantifiers. Also, you will comprehend the importance of grouping before finally moving on to performance optimization techniques like the RegexBuddy Tool and Backtracking. Mastering Python Regular Expressions provides all the information essential for a better understanding of Regular Expressions in Python.
Table of Contents (12 chapters)

History, relevance, and purpose


Regular expressions are pervasive. They can be found in the newest offimatic suite or JavaScript framework to those UNIX tools dating back to the 70s. No modern programming language can be called complete until it supports regular expressions.

Although they are prevalent in languages and frameworks, regular expressions are not yet pervasive in the modern coder's toolkit. One of the reasons often used to explain this is the tough learning curve that they have. Regular expressions can be difficult to master and very complex to read if they are not written with care.

As a result of this complexity, it is not difficult to find in Internet forums the old chestnut:

 

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

 
 --Jamie Zawinski, 1997

You'll find it at https://groups.google.com/forum/?hl=en#!msg/alt.religion.emacs/DR057Srw5-c/Co-2L2BKn7UJ.

Going through this book, we'll learn how to leverage the best practices when writing regular expressions to greatly simplify the reading process.

Even though regular expressions can be found in the latest and greatest programming languages nowadays and will, probably, for many years on, their history goes back to 1943 when the neurophysiologists Warren McCulloch and Walter Pitts published A logical calculus of the ideas immanent in nervous activity. This paper not only represented the beginning of the regular expressions, but also proposed the first mathematical model of a neural network.

The next step was taken in 1956, this time by a mathematician. Stephen Kleene wrote the paper Representation of events in nerve nets and finite automata, where he coined the terms regular sets and regular expressions.

Twelve years later, in 1968, a legendary pioneer of computer science took Kleene's work and extended it, publishing his studies in the paper Regular Expression Search Algorithm. This engineer was Ken Thompson, known for the design and implementation of Unix, the B programming language, the UTF-8 encoding, and others.

Ken Thompson's work didn't end in just writing a paper. He included support for these regular expressions in his version of QED. To search with a regular expression in QED, the following had to be written:

g/<regular expression>/p

In the preceding line of code, g means global search and p means print. If, instead of writing regular expression, we write the short form re, we get g/re/p, and therefore, the beginnings of the venerable UNIX command-line tool grep.

The next outstanding milestones were the release of the first non-proprietary library of regex by Henry Spence, and later, the creation of the scripting language Perl by Larry Wall. Perl pushed the regular expressions to the mainstream.

The implementation in Perl went forward and added many modifications to the original regular expression syntax, creating the so-called Perl flavor. Many of the later implementations in the rest of the languages or tools are based on the Perl flavor of regular expressions.

The IEEE thought their POSIX standard has tried to standardize and give better Unicode support to the regular expression syntax and behaviors. This is called the POSIX flavor of the regular expressions.

Today, the standard Python module for regular expressions—re—supports only Perl-style regular expressions. There is an effort to write a new regex module with better POSIX style support at https://pypi.python.org/pypi/regex. This new module is intended to replace Python's re module implementation eventually. In this book, we will learn how to leverage only the standard re module.

Tip

Regular expressions, regex, regexp, or regexen?

Henry Spencer referred indistinctly to his famous library as "regex" or "regexp". Wikipedia proposed regex or regexp to be used as abbreviations. The famous Jargon File lists them as regexp, regex, and reg-ex.

However, even though there does not seem to be a very strict approach to naming regular expressions, they are based in the field of mathematics called formal languages, where being exact is everything. Most modern implementations support features that cannot be expressed in formal languages, and therefore, they are not real regular expressions. Larry Wall, creator of the Perl language, used the term regexes or regexen for this reason.

In this book, we will indistinctly use all the aforementioned terms as if they were perfect synonyms.