In this example, we'll create a simple file that contains some UTF-8 data that exists outside of the ASCII range. This ensures that we'll actually have some multi-byte characters. To generate the test data, point your browser to http://www.translit.ru.
First, create a text file and name it
russian.txt
. Using the previous site, generate the following text and save the file. The file is also included in a file bundle available on the Packt FTP site.Example UTF-8 Multibyte: Текст
Next, enter the following code and save it as
utf_coding.py
.#!/usr/bin/python with open('russian.txt', 'r') as ru: txt = ru.read() # Bytes Read print "Bytes: %d" % len(txt) # First, we'll decode. uc = txt.decode('utf-8') # Chars after decode print "Chars: %d" % len(uc)
Finally, let's run the example code. Your output should be similar to what's seen here.
(text_processing)$ python utf_coding.py