Normalizing text
In many cases, a single word can be written in multiple ways. For example, users who wrote "Über" and "Uber" probably meant the same word. If you were implementing a feature like tagging for a blog, you certainly don't want to end up with two different tags for the two words.
So, before saving your tags, you might want to normalize them to plain ASCII characters so that they end up all being considered as the same tag.
How to do it...
What we need is a translation map that converts all accented characters to their plain representation:
importunicodedata,sysclassunaccented_map(dict):def__missing__(self,key):ch=self.get(key)ifchisnotNone:returnchde=unicodedata.decomposition(chr(key))ifde:try:ch=int(de.split(None,1)[0],16)except(IndexError,ValueError):ch=keyelse:ch=keyself[key]=chreturnchunaccented_map=unaccented_map()
Then we can apply it to any word to normalize it:
>>> 'Über'.translate(unaccented_map)Uber>>> 'garçon'.translate(unaccented_map)garcon