A big keyboard might have almost 100 individual keys. Fewer than 50 of these are letters, numbers and punctuation. At least a dozen are function keys that do things other than simply insert letters into a document. Some of the keys are different kinds of modifiers that are meant to be used in conjunction with another key—we might have Shift, Ctrl, Option, and Command.
Most operating systems will accept simple key combinations that create about 100 or so characters. More elaborate key combinations may create another 100 or so less popular characters. This isn't even close to covering the million characters from the world's alphabets. And there are icons, emoticons, and dingbats galore in our computer fonts. How do we get to all of those glyphs?
Python works in Unicode. There are millions of individual Unicode characters available.
We can see all the available characters at https://en.wikipedia.org/wiki/List_of_Unicode_characters and also http://www.unicode.org/charts/.
We'll need the Unicode character number. We might also want the Unicode character name.
A given font on our computer may not be designed to provide glyphs for all of those characters. In particular, Windows computer fonts may have trouble displaying some of these characters. Using the Windows command to change to code page 65001 is sometimes necessary:
chcp 65001
Linux and Mac OS X rarely have problems with Unicode characters.
Python uses escape sequences to extend the ordinary characters we can type to cover the vast space of Unicode characters. The escape sequences start with a \
character. The next character tells exactly how the Unicode character will be represented. Locate the character that's needed. Get the name or the number. The numbers are always given as hexadecimal, base 16. They're often written as U+2680
. The name might be DIE FACE-1
. Use \unnnn
with up to a four-digit number. Or use \N{name}
with the spelled-out name. If the number is more than four digits, use \Unnnnnnnn
with the number padded out to eight digits:
Yes, we can include a wide variety of characters in Python output. To place a \
character in the string, we need to use \\
. For example, we might need this for Windows filenames.
Python uses Unicode internally. The 128 or so characters we can type directly using the keyboard all have handy internal Unicode numbers.
When we write:
'HELLO'
Python treats it as shorthand for this:
'\u0048\u0045\u004c\u004c\u004f'
Once we get beyond the characters on our keyboards, the remaining millions of characters are identified only by their number.
When the string is being compiled by Python, the \uxx
, \Uxxxxxxxx
, and \N{name}
are all replaced by the proper Unicode character. If we have something syntactically wrong—for example, \N{name
with no closing }
—we'll get an immediate error from Python's internal syntax checking.
Back in the String parsing with regular expressions recipe, we noted that regular expressions use a lot of \
characters and we specifically do not want Python's normal compiler to touch them; we used the r'
prefix on a regular expression string to prevent the \
from being treated as an escape and possibly converted to something else.
What if we need to use Unicode in a Regular Expression? We'll need to use \\
all over the place in the Regular Expression. We might see this '\\w+[\u2680\u2681\u2682\u2683\u2684\u2685]\\d+'
. We skipped the r'
prefix on the string. We doubled up the \
used for Regular Expressions. We used \uxxxx
for the Unicode characters that are part of the pattern. Python's internal compiler will replace the \uxxxx
with Unicode characters and the \\
with a single \
internally.
Note
When we look at a string at the >>>
prompt, Python will display the string in its canonical form. Python prefers to use the '
as a delimiter even though we can use either '
or "
for a string delimiter. Python doesn't generally display raw strings, instead it puts all of the necessary escape sequences back into the string: >>> r"\w+"
'\\w+'
We provided a string in raw form. Python displayed it in canonical form.
- In the Encoding strings – creating ASCII and UTF-8 bytes and the Decoding Bytes - How to get proper characters from some bytes recipes we'll look at how Unicode characters are converted to sequences of bytes so we can write them to a file. We'll look at how bytes from a file (or downloaded from a website) are turned into Unicode characters so they can be processed.
- If you're interested in history, you can read up on ASCII and EBCDIC and other old-fashioned character codes here http://www.unicode.org/charts/.