-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Python Automation Cookbook
By :
When dealing with text, it's often necessary to manipulate and process it; that is, to be able to join it, split it into regular chunks, or change it to be uppercase or lowercase. We'll discuss more advanced methods for parsing text and separating it later; however, in lots of cases, it is useful to divide a paragraph into lines, sentences, or even words. Other times, words will require some characters to be removed or a word will need to be replaced with a canonical version to be able to compare it with a predetermined value.
We'll define a basic piece of text and transform it into its main components; then, we'll reconstruct it. As an example, a report needs to be transformed into a new format to be sent via email.
The input format we'll use in this example will be this:
AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
WITH THE OBJECTIVES FOR THE YEAR. THE MAIN DRIVER OF THE SALES HAS BEEN
THE NEW PACKAGE DESIGNED UNDER THE SUPERVISION OF OUR MARKETING DEPARTMENT.
OUR EXPENSES HAS BEEN CONTAINED, INCREASING ONLY BY 0.7%, THOUGH THE BOARD
CONSIDERS IT NEEDS TO BE FURTHER REDUCED. THE EVALUATION IS SATISFACTORY
AND THE FORECAST FOR THE NEXT QUARTER IS OPTIMISTIC. THE BOARD EXPECTS
AN INCREASE IN PROFIT OF AT LEAST 2 MILLION DOLLARS.
We need to redact the text to eliminate any references to numbers. It needs to be properly formatted by adding a new line after each period, justified with 80 characters, and transformed into ASCII for compatibility reasons.
The text will be stored in the INPUT_TEXT variable in the interpreter.
>>> INPUT_TEXT = '''
... AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
... HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
...
'''
>>> words = INPUT_TEXT.split()
'X' character:
>>> redacted = [''.join('X' if w.isdigit() else w for w in word) for word in words]
>>> ascii_text = [word.encode('ascii', errors='replace').decode('ascii')
... for word in redacted]
>>> newlines = [word + '\n' if word.endswith('.') else word for word in ascii_text]
>>> LINE_SIZE = 80
>>> lines = []
>>> line = ''
>>> for word in newlines:
... if line.endswith('\n') or len(line) + len(word) + 1 > LINE_SIZE:
... lines.append(line)
... line = ''
... line = line + ' ' + word
>>> lines = [line.title() for line in lines]
>>> result = '\n'.join(lines)
>>> print(result)
After The Close Of The Second Quarter, Our Company, Casta?Acorp Has Achieved A Growth In The Revenue Of X.Xx%. This Is In Line With The Objectives For The Year. The Main Driver Of The Sales Has Been The New Package Designed Under The Supervision Of Our Marketing Department. Our Expenses Has Been Contained, Increasing Only By X.X%, Though The Board Considers It Needs To Be Further Reduced. The Evaluation Is Satisfactory And The Forecast For The Next Quarter Is Optimistic.
Each step performs a specific transformation of the text:
'X' is returned instead. This is done with two list comprehensions, one to run on the list, and another on each word, replacing them only if there's a digit —['X' if w.isdigit() else w for w in word]. Note that the words are joined together again.errors parameter to force the replacement of unknown characters such as ñ.
The difference between strings and bytes is not very intuitive at first, especially if you never have to worry about multiple languages or encoding transformations. In Python 3, there's a strong separation between strings (internal Python representation) and bytes. So most of the tools applicable to strings won't be available in byte objects. Unless you have a good idea of why you need a byte object, always work with Python strings. If you need to perform transformations like the one in this task, encode and decode in the same line so that you keep your objects within the comfortable realm of Python strings. If you are interested in learning more about encodings, you can refer to this brief article: https://eli.thegreenplace.net/2012/01/30/the-bytesstr-dichotomy-in-python-3 and this other, longer and more detailed one: http://www.diveintopython3.net/strings.html.
\n character) for all words ending with a period. This marks the different paragraphs. After that, it creates a line and adds the words one by one. If an extra word will make it go over 80 characters, it finishes the line and starts a new one. If the line already ends with a new line, it finishes it and starts another one as well. Note that there's an extra space added to separate the words.Some other useful operations that can be performed on strings are as follows:
"word"[0:2] will return "wo"..splitlines() to separate lines with a newline character..upper() and .lower() methods, which return a copy with all of the characters set to uppercase or lowercase. Their use is very similar to .title():
>>> 'UPPERCASE'.lower()
'uppercase'
.replace(). This method is useful for very simple cases, but replacements can get tricky easily. Be careful with the order of replacements to avoid collisions and case sensitivity issues. Note the wrong replacement in the following example:
>>> 'One ring to rule them all, one ring to find them, One ring to bring them all and in the darkness bind them.'.replace('ring', 'necklace')
'One necklace to rule them all, one necklace to find them, One necklace to bnecklace them all and in the darkness bind them.'
This is similar to the issues we'll encounter with regular expressions matching unexpected parts of your code. There are more examples to follow later. Refer to the regular expressions recipes for more information.
To wrap text lines, you can use the textwrap module included in the standard library, instead of manually counting characters. View the documentation here: https://docs.python.org/3/library/textwrap.html.
If you work with multiple languages, or with any kind of non-English input, it is very useful to learn the basics of Unicode and encodings. In a nutshell, given the vast amount of characters in all the different languages in the world, including alphabets not related to the Latin one, such as Chinese or Arabic, there's a standard to try and cover all of them so that computers can properly understand them. Python 3 greatly improved this situation, making the internal objects of the strings can deal with all of those characters. The default encoding that Python uses, and the most common and compatible one, is currently UTF-8.
A good article to learn about the basics of UTF-8 is this blog post: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.
Dealing with encodings is still relevant when reading from external files that can be encoded in different encodings (for example, CP-1252 or windows-1252, which is a common encoding produced by legacy Microsoft systems, or ISO 8859-15, which is the industry standard).
Change the font size
Change margin width
Change background colour