Book Image

MySQL Admin Cookbook

By : Daniel Schneller, Udo Schwedt
Book Image

MySQL Admin Cookbook

By: Daniel Schneller, Udo Schwedt

Overview of this book

<p>MySQL is the most popular open-source database and is also known for its easy set up feature. However, proper configuration beyond the default settings still is a challenge, along with some other day-to-day maintenance tasks such as backing up and restoring, performance tuning, and server monitoring. These tasks have not been covered thoroughly in the default documentation.<br /><br />This book provides both step-by-step recipes and relevant background information on these topics and more. It covers everything from basic to advanced aspects of MySQL administration and configuration. One of the things you are really going to love about this book is that all recipes are based on real-world experience and were derived from proven solutions used in an enterprise environment.<br /><br />This book shows you everything you need to know about MySQL Administration. You will learn to set up MySQL replication to manage load balancing and deal with online backup and fail-over scenarios. As you consider the benefits of backing up, you might like to back up your database efficiently with advanced techniques covered in this book.<br /><br />The book demonstrates how to create, modify, and delete indexes. You will also learn to identify duplicate indexes, which hinder your MySQL server performance. This book focuses on administration tasks and will help you as an administrator to optimize the database for efficiency and reliability.<br /><br />You will learn to manage data efficiently by inserting data in existing database content and importing and exporting databases. The sooner you learn about taking advantage of metadata from this book, the sooner you can start using the space efficiently. Get to know about managing users and assigning privileges and regaining lost administrative user credentials. Finally, learn to manage the database schema by customizing it to automate database schema evolution in the context of application updates.</p>
Table of Contents (16 chapters)
MySQL Admin Cookbook
Credits
About the Authors
About the Reviewers
Preface
Index

Choosing character sets


This recipe is not a step-by-step set of instructions to follow. Consider the information presented here as a list of topics that guide you towards a suitable configuration.

Text around the world

Whenever you store textual data in a database in CHAR, VARCHAR, or TEXT columns (so virtually always)—you have to think about character sets and collations.

English does not have any special characters apart from the usual Roman letters which makes it different from many other languages. For example, the French language requires characters like â, é, ò, German texts will likely contain things like Ä, Ö, or ß that are often forgotten when designing computer systems, unless you live in one of these countries. Of course, Chinese, Japanese, Korean, Russian, among others have to be considered as well, being based on completely different alphabets (if they are alphabetic at all).

Nevertheless, in today's global and networked systems, it has become common for a database to be used by people from around the world, all of them expecting to be able to use their native language with all its subtleties.

Character sets

Basically, a so-called character set is like a table that contains a mapping between any character a given human language makes use of and a numerical representation the computer uses to store that character internally. This concept is analogous to say the Morse code in which, instead of mapping a character to a number, it is matched to a sequence of short and long signals.

We will not dive too deeply into the theory of these character sets and their accompanying collations (how to sort and compare characters in any given language), as there is lots of information about this available elsewhere, including the MySQL online manual. For now suffice it to say that MySQL has good support for a wide variety of character sets and collations, and allows you to specify which ones to use on a per-server, per-database, per-table, and even per-column basis.

You can find all the details in the online manual's Chapter 9 on Internationalization and Localization at http://dev.mysql.com/doc/refman/5.1/en/internationalization-localization.html.

Defaults

Even though you might not be immediately aware of it, any MySQL database schema you set-up contains information about the character set to use for the tables and columns contained in it. When you do not tell it to do otherwise, it will just silently use the default settings it was shipped with.

Instead of a concrete step-by-step walkthrough style recipe, we will merely provide some checkpoints to think about when you design a database. As is often the case with complex topics, there is no one answer that perfectly fits every scenario. However, you will find advice on what basic aspects to consider when deciding on the character configuration you are going to use.

Multiple levels of configuration

Be aware that due to MySQL's flexibility you do not have to decide on a single character set for all your data, but can go for multiple sets if needed. We do, however, recommend sticking with a single character set if at all possible because matters quickly become very complicated and hard to maintain when dealing with JOINs, and different client programs, and so on.

For details on this, refer to the MySQL online manual, section 9.1.3 at http://dev.mysql.com/doc/refman/5.1/en/charset-syntax.html.

Getting ready...

To find out what character sets are available to choose from, you can either refer to the MySQL online manual at http://dev.mysql.com/doc/refman/5.1/en/charset-charsets.html or retrieve a list from your MySQL server. To get it, connect to the server and issue this command:

mysql> SHOW CHARACTER SET;

This will output a table of character sets supported by your server with their names, short descriptions, the corresponding default collation, and the number of bytes any single character will use up at most if stored in the database. We will get to this in a minute.

How to do it...

Even though we said before that there is no quick answer, you may skip reading the remainder of this recipe if you are not interested in getting too many details, but just want quick general advice that might not be an ideal solution for you, but will work well and keep your options open for the future:

Tip

For the impatient:

Use utf8 as you default encoding for all your tables.

Following this advice will enable you to store any international text in your database correctly.

If, however, you would like to know more before making a decision, please read on.

Determining required languages

The most important decision you have to make is whether you want (or have) to support contents in more than a single language. If you know for sure that a database is going to exclusively contain English words, matters are going to be rather easy. In this case, you can just use the default latin1 character set. Be aware, however, that you might want to specify a different collation from the default Swedish one.

For any other single-language content, go through the list of supported character sets and see if there is one for your language or family of languages. If in doubt, then read on.

Choosing from Unicode character sets

In case you cannot or do not want to commit yourself to a single language or family of languages, you should probably choose one of the Unicode-based character sets. Those are designed to handle many languages and mixed language content well. See below for more details on which Unicode character set is most suitable for your needs.

Deciding on a Unicode character set

MySQL supports two Unicode character sets: ucs2 (which is the predecessor to UTF-16, but still referred to with the older name in much of MySQL's documentation) and utf8. Both are suitable to store any character that has been defined in Unicode. The main difference between them is the amount of space they require for a single character. With ucs2, characters are uniformly stored as a two-byte sequence, whereas with the utf8 character set, the amount of storage required for a character depends on the individual character.

The UTF-8 encoding was designed to allow for a smooth migration from the commonplace single-byte systems to more sophisticated and internationally usable software. The basic idea was to continue representing the most often used characters of the Western languages as a single byte, just as before, and dynamically use more than one byte per character for more "unusual" (meaning non-ASCII) symbols. A well-defined mapping algorithm was designed to be able to automatically map any not-so-usual character to a two-, three- or at most four-byte-long representation.

A positive and intentional effect of this technique is that an English or German text will not require more space in utf8 (at least not significantly more) than if it were encoded in a single-byte character set like latin1. This is because there are only so many Umlauts in a German text—none in English whatsoever—meaning that the bulk of the information continues to be stored with just one byte per character.

Moreover, any existing software program that can handle regular single-byte text information will continue to work with UTF-8 encoded information, maybe just displaying non-ASCII characters incorrectly as two or more separate symbols.

The major downside is that you cannot tell the exact amount of space you need to reserve to store any UTF-8 encoded text in advance because depending on the characters in the text those requirements vary. To be on the safe side, you need to prepare for the worst case, having each and every character of a text requiring the full four bytes. For a CHAR(10)

column, 40 bytes of storage space have to be reserved to be certain that there is enough room to store any sequence of 10 characters. This is not 100-percent exact, but good enough for our purposes here. See the MySQL online manual, section 10.5 at http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html for all details on storage requirements for each data type.

With ucs2, you have the benefit of being able to tell exactly how much space you will need to store any given text, provided you know how many characters it has. This means that a column defined, as say CHAR(10) will use 20 bytes of disk space internally. For the regular Western language-based text this will usually be more than the equivalent utf8 encoded version would take, but at least you can plan in advance.

A major drawback of ucs2 (and any double-byte character set in general) is that most software products are not ready to process it because they were designed with a single byte per character in mind.

From all the information above, we recommend you use utf8 in any case where English or any Western language text will make up the bulk of the contents you are going to store in the database. This will result in the most space-efficient and yet compatible way of storing textual information while preserving any international characters.

Only when you know in advance that the bulk of your contents will be in languages different from those, most notably languages from the Middle East and the Far East, should you go with ucs2 from a space-efficiency standpoint. However, be sure you are aware of the other implications this has in terms of database client support. If in doubt, utf8 is the safest option here as well.

Considering conversion needs between server and clients

The most important benefit of using ucs2 instead of utf8 on the MySQL server is predictable space requirements. However, MySQL for some reason does not support ucs2 as a character set for returning data to any client software. What this means is that even though a character might be stored with two bytes in the InnoDB table space or MyISAM data files, the server will not just send you those two bytes when you ask for that data. Instead, it will convert it to a different encoding before sending the data across the network. Whatever the reason for this, it entails an additional burden on the server for any data entering or leaving a MySQL table.

To make matters worse from a processing-efficiency point of view, many modern systems internally use ucs2 or its successor UTF-16 anyway (Java or Windows to name just two). So in theory, those could take the data verbatim from the database server and go on processing it. Instead, MySQL will convert the data from its internal ucs2 format to, for example, utf8, prior to sending it to the client, which in turn will then often convert it right back to UTF-16.

As a consequence, we (again) recommend you use utf8 for database internal storage if none of the national character sets fit your need, to save on the conversion effort when sending data back and forth between clients and the server. Only when you really need to focus on the reduced storage space for Eastern languages, should you consider setting up your data store with ucs2.