You are here:Home»KB»Programming»MySQL»What MySQL Database Collation should I use?
Thursday, 13 September 2012 12:54

What MySQL Database Collation should I use?

Written by

There are a few things to go through to help you understand MySQL collations before you will understand what collation you should use.

The difference between utf8_general_ci and utf8_unicode_ci

To know the difference between utf8_general_ci and utf8_unicode_ci we need to break down the collation's name.

  1. UTF8 - this is the character set to be used. Computer using different languages reference characters with different ascii/binary references such as latin1. UTF8 is a character set which try to cover all characters in one set.
  2. This question should be "What is the difference between general_ci and unicode_ci?"
    The difference is the algorithm MySQL uses to search (and interact i think) with the database. general_ci is less resource intensive but not as good as unicode_ci. Because of the power of modern webservers performance is no longer a concern so unicode should be used.

So this means that:

  • utf8_general_ci collation uses the UTF8 character set with the general search algorithm with good performance and lower quality results
  • utf8_unicode_ci collation uses the UTF8 character set with the unicode search algorithm with lower performance and better quality results

Where possible select utf8_unicode_ci (see notes below)

The difference between UTF8MB4 and UTF8

The correct implementation of all UTF8 characters requires 4 bytes per character but for some reason PHP/MySQL implemented UTF8 using only 3 Bytes and they called this UTF8. It has taken until recent times for servers to fully support this and CMS platforms to catch up, they now call this UTF8MB4 and it is fully backwards compatible with UTF8.

Using UTF8MB4 allows you to use all UTF8 characters including emojis and some international characters and store these in the MySQL database.

So this means that:

  • utf8_unicode_ci collation uses the UTF8 character set with a 3 byte storage limiting the UTF8 characters that can be used
  • utf8mb4_unicode_ci collation uses the full UTF8MB4 with a 4 byte character allowing the full use of all UTF8 characters
  • UTF8 Tables and Databases can be upgraded to UTF8MB with no data loss. The reverse will cause data loss.
  • Better international language support because more characters can be used in one collation
  • Allows the use of native emoji characters
  • UTF8 and UT8MB4 are not language specific

Where are collations used?

We must know that there are several places to use a collation and they are all set seperately.

  1. Server Connection Collation - There connection between PHP and MySQL daemon
  2. Default Collation - The databases has a default collation
  3. Table Collation - Tables in the database can have a specific collation set which does not have to be the default collation.

Have you heard the expression "A chain is only as storng as its weakest link", the journey of data from PHP to a MySQL database is not different. So what do I mean? As you can see above there are several places where you can set collations, each of these collations need to be able to support the data you are sending it.

When a UT8MB4 character is sent to the database if any of the steps/collations in the chain are set to UTF8 only, the the character will be trunctacted (the last byte cut off) and a different character will be stored. If you send a UTF8 character down a chain that all supports UTF8MB4 then no data loss will occur because UTF8MB4 is backawards compatible with UTF8.

What Collation should I use?

So now if we take what we have learned and summarise here:

  • unicode_ci give better results that general_ci and performance is no longer an issue
  • UTF8 Databases and Tables can be upgraded to UTF8MB4 without data loss
  • UTF8MB4 is backwards compatible with UTF8
  • UTF8MB4 is a correct implementation of UTF8 and allows use of all UTF8 characters
  • Most webservers now fully support UTF8MB4
  • Most popular CMS are now pushing UTF8MB4
  • UTF8MB4 has better international language and allows you to use emoji
  • the whole data chain (PHP --> MySQL Database) needs to support the collation you are using

Which means you should ideally use the following:

  1. Server Connection Collation - utf8mb4_unicode_ci
  2. Default Collation - utf8mb4_unicode_ci
  3. Table Collation- utf8mb4_unicode_ci (when needed)

In Summary

You will be using the UTF8MB4 collation for the whole chain which fully supports all UTF8 characters (3 and 4 byte) and is backwards compatible with UTF8 (3 byte) allowing all languages to be used. The unicode_ci search algorithm is used for better results as the performance hit is not longer noticable.

Links

Read 14306 times Last modified on Tuesday, 08 October 2019 09:50