Rewriting an immutable string
How can we rewrite an immutable string? We can't change individual characters inside a string:
>>> title = "Recipe 5: Rewriting, and the Immutable String"
>>> title[8] = ''
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
Since this doesn't work, how do we make a change to a string?
Getting ready
Let's assume we have a string like this:
>>> title = "Recipe 5: Rewriting, and the Immutable String"
We'd like to do two transformations:
- Remove the part up to the
:
- Replace the punctuation with
_
, and make all the characters lowercase
Since we can't replace characters in a string object, we have to work out some alternatives. There are several common things we can do, shown as follows:
- A combination of slicing and concatenating a string to create a new string.
- When shortening, we often use the
partition()
method. - We can replace a character or a substring with the
replace()
method. - We can expand the string into a list of characters, then join the string back into a single string again. This is the subject of a separate recipe, Building complex strings with a list of characters.
How to do it...
Since we can't update a string in place, we have to replace the string variable's object with each modified result. We'll use an assignment statement that looks something like this:
some_string = some_string.method()
Or we could even use an assignment like this:
some_string = some_string[:chop_here]
We'll look at a few specific variations of this general theme. We'll slice a piece of a string, we'll replace individual characters within a string, and we'll apply blanket transformations such as making the string lowercase. We'll also look at ways to remove extra _
that show up in our final string.
Slicing a piece of a string
Here's how we can shorten a string via slicing:
- Find the boundary:
>>> colon_position = title.index(':')
The
index
function locates a particular substring and returns the position where that substring can be found. If the substring doesn't exist, it raises an exception. The following expression will always be true:title[colon_position] == ':'
. - Pick the substring:
>>> discard, post_colon = title[:colon_position], title[colon_position+1:] >>> discard 'Recipe 5' >>> post_colon ' Rewriting, and the Immutable String'
We've used the slicing notation to show the start:end
of the characters to pick. We also used multiple assignment to assign two variables, discard
and post_colon
, from the two expressions.
We can use partition()
, as well as manual slicing. Find the boundary and partition:
>>> pre_colon_text, _, post_colon_text = title.partition(':')
>>> pre_colon_text
'Recipe 5'
>>> post_colon_text
' Rewriting, and the Immutable String'
The partition
function returns three things: the part before the target, the target, and the part after the target. We used multiple assignment to assign each object to a different variable. We assigned the target to a variable named _
because we're going to ignore that part of the result. This is a common idiom for places where we must provide a variable, but we don't care about using the object.
Updating a string with a replacement
We can use a string's replace()
method to create a new string with punctuation marks removed. When using replace
to switch punctuation marks, save the results back into the original variable. In this case, post_colon_text
:
>>> post_colon_text = post_colon_text.replace(' ', '_')
>>> post_colon_text = post_colon_text.replace(',', '_')
>>> post_colon_text
'_Rewriting__and_the_Immutable_String'
This has replaced the two kinds of punctuation with the desired _
characters. We can generalize this to work with all punctuation. This leverages the for
statement, which we'll look at in Chapter 2, Statements and Syntax.
We can iterate through all punctuation characters:
>>> from string import whitespace, punctuation
>>> for character in whitespace + punctuation:
... post_colon_text = post_colon_text.replace(character, '_')
>>> post_colon_text
'_Rewriting__and_the_Immutable_String'
As each kind of punctuation character is replaced, we assign the latest and greatest version of the string to the post_colon_text
variable.
We can also use a string's translate()
method for this. This relies on creating a dictionary object to map each source character's position to a resulting character:
>>> from string import whitespace, punctuation
>>> title = "Recipe 5: Rewriting an Immutable String"
>>> title.translate({ord(c): '_' for c in whitespace+punctuation})
Recipe_5__Rewriting_an_Immutable_String
We've created a mapping with {ord(c): '_' for c in whitespace+punctuation}
to translate any character, c
, in the whitespace+punctuation
sequence of characters to the '_'
character. This may have better performance than a sequence of individual character replacements.
Removing extra punctuation marks
In many cases, there are some additional steps we might follow. We often want to remove leading and trailing _
characters. We can use strip()
for this:
>>> post_colon_text = post_colon_text.strip('_')
In some cases, we'll have multiple _
characters because we had multiple punctuation marks. The final step would be something like this to clean up multiple _
characters:
>>> while '__' in post_colon_text:
... post_colon_text = post_colon_text.replace('__', '_')
This is yet another example of the same pattern we've been using to modify a string in place. This depends on the while
statement, which we'll look at in Chapter 2, Statements and Syntax.
How it works...
We can't—technically—modify a string in place. The data structure for a string is immutable. However, we can assign a new string back to the original variable. This technique behaves the same as modifying a string in place.
When a variable's value is replaced, the previous value no longer has any references and is garbage collected. We can see this by using the id()
function to track each individual string object:
>>> id(post_colon_text)
4346207968
>>> post_colon_text = post_colon_text.replace('_','-')
>>> id(post_colon_text)
4346205488
Your actual ID numbers may be different. What's important is that the original string object assigned to post_colon_text
had one ID. The new string object assigned to post_colon_text
has a different ID. It's a new string object.
When the old string has no more references, it is removed from memory automatically.
We made use of slice notation to decompose a string. A slice has two parts: [start:end]
. A slice always includes the starting index. String indices always start with zero as the first item. A slice never includes the ending index.
The items in a slice have an index from start
to end-1
. This is sometimes called a half-open interval.
Think of a slice like this: all characters where the index i is in the range start ≤ i < end.
We noted briefly that we can omit the start or end indices. We can actually omit both. Here are the various options available:
title[colon_position]
: A single item, that is, the:
we found usingtitle.index(':')
.title[:colon_position]
: A slice with the start omitted. It begins at the first position, index of zero.title[colon_position+1:]
: A slice with the end omitted. It ends at the end of the string, as if we saidlen(title)
.title[:]
: Since both start and end are omitted, this is the entire string. Actually, it's a copy of the entire string. This is the quick and easy way to duplicate a string.
There's more...
There are more features for indexing in Python collections like a string. The normal indices start with 0 on the left. We have an alternate set of indices that use negative numbers that work from the right end of a string:
title[-1]
is the last character in the title,'g'
title[-2]
is the next-to-last character,'n'
title[-6:]
is the last six characters,'String'
We have a lot of ways to pick pieces and parts out of a string.
Python offers dozens of methods for modifying a string. The Text Sequence Type — str section of the Python Standard Library describes the different kinds of transformations that are available to us. There are three broad categories of string methods: we can ask about the string, we can parse the string, and we can transform the string to create a new one. Methods such as isnumeric()
tell us if a string is all digits.
Here's an example:
>>> 'some word'.isnumeric()
False
>>> '1298'.isnumeric()
True
Before doing comparisons, it can help to change a string so that it has the same uniform case. It's frequently helpful to use the lower()
method, thus assigning the result to the original variable:
>>> post_colon_text = post_colon_text.lower()
We've looked at parsing with the partition()
method. We've also looked at transforming with the lower()
method, as well as the replace()
and translate()
methods.
See also
- We'll look at the string as list technique for modifying a string in the Building complex strings from lists of characters recipe.
- Sometimes, we have data that's only a stream of bytes. In order to make sense of it, we need to convert it into characters. That's the subject of the Decoding bytes – how to get proper characters from some bytes recipe.