This recipe is not a step-by-step set of instructions to follow. Consider the information presented here as a list of topics that guide you towards a suitable configuration.
Whenever you store textual data in a database in CHAR
, VARCHAR
, or TEXT
columns (so virtually always)—you have to think about character sets and collations.
English does not have any special characters apart from the usual Roman letters which makes it different from many other languages. For example, the French language requires characters like â, é, ò, German texts will likely contain things like Ä, Ö, or ß that are often forgotten when designing computer systems, unless you live in one of these countries. Of course, Chinese, Japanese, Korean, Russian, among others have to be considered as well, being based on completely different alphabets (if they are alphabetic at all).
Nevertheless, in today's global and networked systems, it has become common for a database to be used by people from around the world, all of them expecting to be able to use their native language with all its subtleties.
Basically, a so-called character set is like a table that contains a mapping between any character a given human language makes use of and a numerical representation the computer uses to store that character internally. This concept is analogous to say the Morse code in which, instead of mapping a character to a number, it is matched to a sequence of short and long signals.
We will not dive too deeply into the theory of these character sets and their accompanying collations (how to sort and compare characters in any given language), as there is lots of information about this available elsewhere, including the MySQL online manual. For now suffice it to say that MySQL has good support for a wide variety of character sets and collations, and allows you to specify which ones to use on a per-server, per-database, per-table, and even per-column basis.
You can find all the details in the online manual's Chapter 9 on Internationalization and Localization at http://dev.mysql.com/doc/refman/5.1/en/internationalization-localization.html.
Even though you might not be immediately aware of it, any MySQL database schema you set-up contains information about the character set to use for the tables and columns contained in it. When you do not tell it to do otherwise, it will just silently use the default settings it was shipped with.
Instead of a concrete step-by-step walkthrough style recipe, we will merely provide some checkpoints to think about when you design a database. As is often the case with complex topics, there is no one answer that perfectly fits every scenario. However, you will find advice on what basic aspects to consider when deciding on the character configuration you are going to use.
Be aware that due to MySQL's flexibility you do not have to decide on a single character set for all your data, but can go for multiple sets if needed. We do, however, recommend sticking with a single character set if at all possible because matters quickly become very complicated and hard to maintain when dealing with JOINs, and different client programs, and so on.
For details on this, refer to the MySQL online manual, section 9.1.3 at http://dev.mysql.com/doc/refman/5.1/en/charset-syntax.html.
To find out what character sets are available to choose from, you can either refer to the MySQL online manual at http://dev.mysql.com/doc/refman/5.1/en/charset-charsets.html or retrieve a list from your MySQL server. To get it, connect to the server and issue this command:
mysql> SHOW CHARACTER SET;
This will output a table of character sets supported by your server with their names, short descriptions, the corresponding default collation, and the number of bytes any single character will use up at most if stored in the database. We will get to this in a minute.
Even though we said before that there is no quick answer, you may skip reading the remainder of this recipe if you are not interested in getting too many details, but just want quick general advice that might not be an ideal solution for you, but will work well and keep your options open for the future:
Tip
For the impatient:
Use utf8
as you default encoding for all your tables.
Following this advice will enable you to store any international text in your database correctly.
If, however, you would like to know more before making a decision, please read on.
The most important decision you have to make is whether you want (or have) to support contents in more than a single language. If you know for sure that a database is going to exclusively contain English words, matters are going to be rather easy. In this case, you can just use the default latin1 character set. Be aware, however, that you might want to specify a different collation from the default Swedish one.
For any other single-language content, go through the list of supported character sets and see if there is one for your language or family of languages. If in doubt, then read on.
In case you cannot or do not want to commit yourself to a single language or family of languages, you should probably choose one of the Unicode-based character sets. Those are designed to handle many languages and mixed language content well. See below for more details on which Unicode character set is most suitable for your needs.
MySQL supports two Unicode character sets: ucs2
(which is the predecessor to UTF-16, but still referred to with the older name in much of MySQL's documentation) and utf8
. Both are suitable to store any character that has been defined in Unicode. The main difference between them is the amount of space they require for a single character. With ucs2
, characters are uniformly stored as a two-byte sequence, whereas with the utf8
character set, the amount of storage required for a character depends on the individual character.
The UTF-8 encoding was designed to allow for a smooth migration from the commonplace single-byte systems to more sophisticated and internationally usable software. The basic idea was to continue representing the most often used characters of the Western languages as a single byte, just as before, and dynamically use more than one byte per character for more "unusual" (meaning non-ASCII) symbols. A well-defined mapping algorithm was designed to be able to automatically map any not-so-usual character to a two-, three- or at most four-byte-long representation.
A positive and intentional effect of this technique is that an English or German text will not require more space in utf8
(at least not significantly more) than if it were encoded in a single-byte character set like latin1
. This is because there are only so many Umlauts in a German text—none in English whatsoever—meaning that the bulk of the information continues to be stored with just one byte per character.
Moreover, any existing software program that can handle regular single-byte text information will continue to work with UTF-8 encoded information, maybe just displaying non-ASCII characters incorrectly as two or more separate symbols.
The major downside is that you cannot tell the exact amount of space you need to reserve to store any UTF-8 encoded text in advance because depending on the characters in the text those requirements vary. To be on the safe side, you need to prepare for the worst case, having each and every character of a text requiring the full four bytes. For a CHAR(10)
column, 40 bytes of storage space have to be reserved to be certain that there is enough room to store any sequence of 10 characters. This is not 100-percent exact, but good enough for our purposes here. See the MySQL online manual, section 10.5 at http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html for all details on storage requirements for each data type.
With ucs2
, you have the benefit of being able to tell exactly how much space you will need to store any given text, provided you know how many characters it has. This means that a column defined, as say CHAR(10)
will use 20 bytes of disk space internally. For the regular Western language-based text this will usually be more than the equivalent utf8
encoded version would take, but at least you can plan in advance.
A major drawback of ucs2
(and any double-byte character set in general) is that most software products are not ready to process it because they were designed with a single byte per character in mind.
From all the information above, we recommend you use utf8
in any case where English or any Western language text will make up the bulk of the contents you are going to store in the database. This will result in the most space-efficient and yet compatible way of storing textual information while preserving any international characters.
Only when you know in advance that the bulk of your contents will be in languages different from those, most notably languages from the Middle East and the Far East, should you go with ucs2
from a space-efficiency standpoint. However, be sure you are aware of the other implications this has in terms of database client support. If in doubt, utf8
is the safest option here as well.
The most important benefit of using ucs2
instead of utf8
on the MySQL server is predictable space requirements. However, MySQL for some reason does not support ucs2
as a character set for returning data to any client software. What this means is that even though a character might be stored with two bytes in the InnoDB table space or MyISAM data files, the server will not just send you those two bytes when you ask for that data. Instead, it will convert it to a different encoding before sending the data across the network. Whatever the reason for this, it entails an additional burden on the server for any data entering or leaving a MySQL table.
To make matters worse from a processing-efficiency point of view, many modern systems internally use ucs2
or its successor UTF-16 anyway (Java or Windows to name just two). So in theory, those could take the data verbatim from the database server and go on processing it. Instead, MySQL will convert the data from its internal ucs2
format to, for example, utf8
, prior to sending it to the client, which in turn will then often convert it right back to UTF-16.
As a consequence, we (again) recommend you use utf8
for database internal storage if none of the national character sets fit your need, to save on the conversion effort when sending data back and forth between clients and the server. Only when you really need to focus on the reduced storage space for Eastern languages, should you consider setting up your data store with ucs2
.