A common occurrence with text processing is finding text that has nonstandard character encoding. Ideally, all text would be ASCII or utf-8, but that's just not the reality. In cases when you have non-ASCII or non-utf-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before doing further processing.
You'll need to install the charade
module using sudo pip install charade
or sudo easy_install charade
. You can learn more about charade
at https://pypi.python.org/pypi/charade.
Encoding detection and conversion functions are provided in encoding.py
. These are simple wrapper functions around the charade
module. To detect the encoding of a string, call encoding.detect(string)
. You'll get back a dict
containing two attributes: confidence
and encoding
. The confidence
attribute is a probability of how confident charade
is that the value for encoding
is...