character set support - pearson educationcharacter_set_databaseand collation_databasesystem...

30
3 Character Set Support Improved support for character set handling was added to MySQL in Version 4.1. The fea- tures described here are as implemented in MySQL 4.1.1. (MySQL 4.1.0 has some but not all of these features, and some of them are implemented differently.) This chapter discusses the following topics: n What are character sets and collations? n The multiple-level default system n New syntax in MySQL 4.1 n Affected functions and operations n Unicode support n The meaning of each individual character set and collation Character set support currently is included in the MySISAM, MEMORY (HEAP), and (as of MySQL 4.1.2) InnoDB storage engines. The ISAM storage engine does not include character set support; there are no plans to change this, because ISAM is deprecated. 3.1 Character Sets and Collations in General A character set is a set of symbols and encodings. A collation is a set of rules for compar- ing characters in a character set. Let’s make the distinction clear with an example of an imaginary character set. Suppose that we have an alphabet with four letters: ‘A’, ‘B’, ‘a’, ‘b’. We give each letter a number: ‘A’ = 0, ‘B’ = 1, ‘a’ = 2, ‘c’ = 3. The letter ‘A’ is a symbol, the number 0 is the encod- ing for ‘A’, and the combination of all four letters and their encodings is a character set. Now, suppose that we want to compare two string values, ‘A’ and ‘B’. The simplest way to do this is to look at the encodings: 0 for ‘A’ and 1 for ‘B’. Because 0 is less than 1, we say ‘A’ is less than ‘B’. Now, what we’ve just done is apply a collation to our character set. The colla- tion is a set of rules (only one rule in this case): “compare the encodings.” We call this sim- plest of all possible collations a binary collation.

Upload: others

Post on 21-Apr-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

3Character Set Support

Improved support for character set handling was added to MySQL in Version 4.1. The fea-tures described here are as implemented in MySQL 4.1.1. (MySQL 4.1.0 has some but notall of these features, and some of them are implemented differently.)

This chapter discusses the following topics:

n What are character sets and collations?n The multiple-level default systemn New syntax in MySQL 4.1n Affected functions and operationsn Unicode supportn The meaning of each individual character set and collation

Character set support currently is included in the MySISAM, MEMORY (HEAP), and (as ofMySQL 4.1.2) InnoDB storage engines. The ISAM storage engine does not include characterset support; there are no plans to change this, because ISAM is deprecated.

3.1 Character Sets and Collations in GeneralA character set is a set of symbols and encodings. A collation is a set of rules for compar-ing characters in a character set. Let’s make the distinction clear with an example of animaginary character set.

Suppose that we have an alphabet with four letters: ‘A’, ‘B’, ‘a’, ‘b’. We give each letter anumber: ‘A’ = 0, ‘B’ = 1, ‘a’ = 2, ‘c’ = 3. The letter ‘A’ is a symbol, the number 0 is the encod-ing for ‘A’, and the combination of all four letters and their encodings is a character set.

Now, suppose that we want to compare two string values, ‘A’ and ‘B’. The simplest way to dothis is to look at the encodings: 0 for ‘A’ and 1 for ‘B’. Because 0 is less than 1, we say ‘A’ isless than ‘B’. Now, what we’ve just done is apply a collation to our character set. The colla-tion is a set of rules (only one rule in this case): “compare the encodings.” We call this sim-plest of all possible collations a binary collation.

04 6337 ch03 6/24/04 12:44 PM Page 81

Page 2: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

82 CHAPTER 3 Character Set Support

But what if we want to say that the lowercase and uppercase letters are equivalent? Then wewould have at least two rules: (1) treat the lowercase letters ‘a’ and ‘b’ as equivalent to ‘A’ and‘B’; (2) then compare the encodings. We call this a case-insensitive collation. It’s a littlemore complex than a binary collation.

In real life, most character sets have many characters: not just ‘A’ and ‘B’ but whole alpha-bets, sometimes multiple alphabets or eastern writing systems with thousands of characters,along with many special symbols and punctuation marks. Also in real life, most collationshave many rules: not just case insensitivity but also accent insensitivity (an “accent” is a markattached to a character as in German ‘ö’) and multiple-character mappings (such as the rulethat ‘ö’ = ‘OE’ in one of the two German collations).

MySQL 4.1 can do these things for you:

n Store strings using a variety of character setsn Compare strings using a variety of collationsn Mix strings with different character sets or collations in the same server, the same data-

base, or even the same tablen Allow specification of character set and collation at any level

In these respects, not only is MySQL 4.1 far more flexible than MySQL 4.0, it also is farahead of other DBMSs. However, to use the new features effectively, you will need to learnwhat character sets and collations are available, how to change their defaults, and what thevarious string operators do with them.

3.2 Character Sets and Collations in MySQLThe MySQL server can support multiple character sets. To list the available character sets,use the SHOW CHARACTER SET statement:

mysql> SHOW CHARACTER SET;

+----------+-----------------------------+---------------------+

| Charset | Description | Default collation |

+----------+-----------------------------+---------------------+

| big5 | Big5 Traditional Chinese | big5_chinese_ci |

| dec8 | DEC West European | dec8_swedish_ci |

| cp850 | DOS West European | cp850_general_ci |

| hp8 | HP West European | hp8_english_ci |

| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |

| latin1 | ISO 8859-1 West European | latin1_swedish_ci |

| latin2 | ISO 8859-2 Central European | latin2_general_ci |

...

04 6337 ch03 6/24/04 12:44 PM Page 82

Page 3: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

833.2 Character Sets and Collations in MySQL

The output actually includes another column that is not shown so that the example fits bet-ter on the page.

Any given character set always has at least one collation. It may have several collations.

To list the collations for a character set, use the SHOW COLLATION statement. For example, tosee the collations for the latin1 (“ISO-8859-1 West European”) character set, use this state-ment to find those collation names that begin with latin1:

mysql> SHOW COLLATION LIKE ‘latin1%’;

+-------------------+---------+----+---------+----------+---------+

| Collation | Charset | Id | Default | Compiled | Sortlen |

+-------------------+---------+----+---------+----------+---------+

| latin1_german1_ci | latin1 | 5 | | | 0 |

| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |

| latin1_danish_ci | latin1 | 15 | | | 0 |

| latin1_german2_ci | latin1 | 31 | | Yes | 2 |

| latin1_bin | latin1 | 47 | | Yes | 1 |

| latin1_general_ci | latin1 | 48 | | | 0 |

| latin1_general_cs | latin1 | 49 | | | 0 |

| latin1_spanish_ci | latin1 | 94 | | | 0 |

+-------------------+---------+----+---------+----------+---------+

The latin1 collations have the following meanings:

Collation Meaning

latin1_bin Binary according to latin1 encoding

latin1_danish_ci Danish/Norwegian

latin1_general_ci Multilingual

latin1_general_cs Multilingual, case sensitive

latin1_german1_ci German DIN-1

latin1_german2_ci German DIN-2

latin1_spanish_ci Modern Spanish

latin1_swedish_ci Swedish/Finnish

Collations have these general characteristics:

n Two different character sets cannot have the same collation.n Each character set has one collation that is the default collation. For example, the default

collation for latin1 is latin1_swedish_ci.n There is a convention for collation names: They start with the name of the character

set with which they are associated, they usually include a language name, and they endwith _ci (case insensitive), _cs (case sensitive), _bin (binary), or _uca (UnicodeCollation Algorithm, http://www.unicode.org/reports/tr10/).

04 6337 ch03 6/24/04 12:44 PM Page 83

Page 4: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

84 CHAPTER 3 Character Set Support

3.3 Determining the Default Character Set andCollationThere are default settings for character sets and collations at four levels: server, database,table, and connection. The following description may appear complex, but it has been foundin practice that multiple-level defaulting leads to natural and obvious results.

3.3.1 Server Character Set and CollationThe MySQL Server has a server character set and a server collation, which may not be null.

MySQL determines the server character set and server collation thus:

n According to the option settings in effect when the server startsn According to the values set at runtime

At the server level, the decision is simple. The server character set and collation depend initial-ly on the options that you use when you start mysqld. You can use --default-character-setfor the character set, and along with it you can add --default-collation for the collation. Ifyou don’t specify a character set, that is the same as saying --default-character-set=latin1. If you specify only a character set (for example, latin1) but not a collation, that is the same as saying --default-charset=latin1 --default-collation=latin1_swedish_ci becauselatin1_swedish_ci is the default collation for latin1. Therefore, the following three com-mands all have the same effect:

shell> mysqld

shell> mysqld --default-character-set=latin1

shell> mysqld --default-character-set=latin1 \

--default-collation=latin1_swedish_ci

One way to change the settings is by recompiling. If you want to change the default servercharacter set and collation when building from sources, use: --with-charset and --with-collation as arguments for configure. For example:

shell> ./configure --with-charset=latin1

Or:

shell> ./configure --with-charset=latin1 \

--with-collation=latin1_german1_ci

Both mysqld and configure verify that the character set/collation combination is valid. Ifnot, each program displays an error message and terminates.

The current server character set and collation are available as the values of thecharacter_set_server and collation_server system variables. These variables can bechanged at runtime.

04 6337 ch03 6/24/04 12:44 PM Page 84

Page 5: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

853.3 Determining the Default Character Set and Collation

3.3.2 Database Character Set and CollationEvery database has a database character set and a database collation, which may not be null.The CREATE DATABASE and ALTER DATABASE statements have optional clauses for specifyingthe database character set and collation:

CREATE DATABASE db_name

[[DEFAULT] CHARACTER SET charset_name]

[[DEFAULT] COLLATE collation_name]

ALTER DATABASE db_name

[[DEFAULT] CHARACTER SET charset_name]

[[DEFAULT] COLLATE collation_name]

Example:

CREATE DATABASE db_name

DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;

MySQL chooses the database character set and database collation thus:

n If both CHARACTER SET X and COLLATE Y were specified, then character set X and colla-tion Y.

n If CHARACTER SET X was specified without COLLATE, then character set X and its defaultcollation.

n Otherwise, the server character set and server collation.

MySQL’s CREATE DATABASE ... DEFAULT CHARACTER SET ... syntax is analogous to the stan-dard SQL CREATE SCHEMA ... CHARACTER SET ... syntax. Because of this, it is possible tocreate databases with different character sets and collations on the same MySQL server.

The database character set and collation are used as default values if the table character setand collation are not specified in CREATE TABLE statements. They have no other purpose.

The character set and collation for the default database are available as the values of thecharacter_set_database and collation_database system variables. The server sets thesevariables whenever the default database changes. If there is no default database, the variableshave the same value as the corresponding server-level variables, character_set_server andcollation_server.

04 6337 ch03 6/24/04 12:44 PM Page 85

Page 6: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

86 CHAPTER 3 Character Set Support

3.3.3 Table Character Set and CollationEvery table has a table character set and a table collation, which may not be null. The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the tablecharacter set and collation:

CREATE TABLE tbl_name (column_list)

[DEFAULT CHARACTER SET charset_name [COLLATE collation_name]]

ALTER TABLE tbl_name

[DEFAULT CHARACTER SET charset_name] [COLLATE collation_name]

Example:

CREATE TABLE t1 ( ... )

DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

MySQL chooses the table character set and collation thus:

n If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.

n If CHARACTER SET X was specified without COLLATE, then character set X and its defaultcollation.

n Otherwise, the database character set and collation.

The table character set and collation are used as default values if the column character setand collation are not specified in individual column definitions. The table character set andcollation are MySQL extensions; there are no such things in standard SQL.

3.3.4 Column Character Set and CollationEvery “character” column (that is, a column of type CHAR, VARCHAR, or TEXT) has a columncharacter set and a column collation, which may not be null. Column definition syntax hasoptional clauses for specifying the column character set and collation:

col_name {CHAR | VARCHAR | TEXT} (col_length)

[CHARACTER SET charset_name [COLLATE collation_name]]

Example:

CREATE TABLE Table1

(

column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci

);

04 6337 ch03 6/24/04 12:44 PM Page 86

Page 7: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

873.3 Determining the Default Character Set and Collation

MySQL chooses the column character set and collation thus:

n If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.n If CHARACTER SET X was specified without COLLATE, then character set X and its default

collation.n Otherwise, the table character set and collation.

The CHARACTER SET and COLLATE clauses are standard SQL.

3.3.5 Examples of Character Set and Collation AssignmentThe following examples show how MySQL determines default character set and collationvalues.

Example 1: Table + Column DefinitionCREATE TABLE t1

(

c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci

) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;

Here we have a column with a latin1 character set and a latin1_german1_ci collation. Thedefinition is explicit, so that’s straightforward. Notice that there’s no problem storing alatin1 column in a latin2 table.

Example 2: Table + Column DefinitionCREATE TABLE t1

(

c1 CHAR(10) CHARACTER SET latin1

) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

This time we have a column with a latin1 character set and a default collation. Now,although it might seem natural, the default collation is not taken from the table level.Instead, because the default collation for latin1 is always latin1_swedish_ci, column c1 willhave a collation of latin1_swedish_ci (not latin1_danish_ci).

Example 3: Table + Column DefinitionCREATE TABLE t1

(

c1 CHAR(10)

) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

We have a column with a default character set and a default collation. In this circumstance,MySQL looks up to the table level for inspiration in determining the column character set andcollation. So, the character set for column c1 is latin1 and its collation is latin1_danish_ci.

04 6337 ch03 6/24/04 12:44 PM Page 87

Page 8: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

88 CHAPTER 3 Character Set Support

Example 4: Database + Table + Column DefinitionCREATE DATABASE d1

DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;

USE d1;

CREATE TABLE t1

(

c1 CHAR(10)

);

We create a column without specifying its character set and collation. We’re also not speci-fying a character set and a collation at the table level. In this circumstance, MySQL looks upto the database level for inspiration. (The database’s settings become the table’s settings, andthereafter become the column’s setting.) So, the character set for column c1 is latin2 and itscollation is latin2_czech_ci.

3.3.6 Connection Character Sets and CollationsSeveral character set and collation system variables relate to a client’s interaction with theserver. Some of these have already been mentioned in earlier sections:

n The server character set and collation are available as the values of thecharacter_set_server and collation_server variables.

n The character set and collation of the default database are available as the values of thecharacter_set_database and collation_database variables.

Additional character set and collation variables are involved in handling traffic for the con-nection between a client and the server. Every client has connection-related character setand collation variables.

Consider what a “connection” is: It’s what you make when you connect to the server. Theclient sends SQL statements, such as queries, over the connection to the server. The serversends responses, such as result sets, over the connection back to the client. This leads to sev-eral questions about character set and collation handling for client connections, each ofwhich can be answered in terms of system variables:

n What character set is the query in when it leaves the client?

The server takes the character_set_client variable to be the character set in whichqueries are sent by the client.

n What character set should the server translate a query to after receiving it?

For this, character_set_connection and collation_connection are used by the server. It converts queries sent by the client from character_set_client to character_set_connection (except for string literals that have an introducer such as _latin1 or _utf8).collation_connection is important for comparisons of literal strings. For comparisonsof strings with column values, it does not matter because columns have a higher colla-tion precedence.

04 6337 ch03 6/24/04 12:44 PM Page 88

Page 9: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

893.3 Determining the Default Character Set and Collation

n What character set should the server translate to before shipping result sets or errormessages back to the client?

The character_set_results variable indicates the character set in which the serverreturns query results to the client. This includes result data such as column values, andresult metadata such as column names.

You can fine-tune the settings for these variables, or you can depend on the defaults (inwhich case, you can skip this section).

There are two statements that affect the connection character sets:

SET NAMES ‘charset_name’

SET CHARACTER SET charset_name

SET NAMES indicates what is in the SQL statements that the client sends. Thus, SET NAMES‘cp1251’ tells the server “future incoming messages from this client will be in character setcp1251.” It also specifies the character set for results that the server sends back to the client.(For example, it indicates what character set column values will have if you use a SELECTstatement.)

A SET NAMES ‘x’ statement is equivalent to these three statements:

mysql> SET character_set_client = x;

mysql> SET character_set_results = x;

mysql> SET character_set_connection = x;

Setting character_set_connection to x also sets collation_connection to the default collation for x.

SET CHARACTER SET is similar but sets the connection character set and collation to be thoseof the default database. A SET CHARACTER SET x statement is equivalent to these three state-ments:

mysql> SET character_set_client = x;

mysql> SET character_set_results = x;

mysql> SET collation_connection = @@collation_database;

When a client connects, it sends to the server the name of the character set that it wants touse. The server sets the character_set_client, character_set_results, andcharacter_set_connection variables to that character set. (In effect, the server performs aSET NAMES operation using the character set.)

With the mysql client, it is not necessary to execute SET NAMES every time you start up if youwant to use a character set different from the default. You can add the --default-character-set option setting to your mysql statement line, or in your option file. For example, the fol-lowing option file setting changes the three character set variables set to koi8r each timeyou run mysql:

[mysql]

default-character-set=koi8r

04 6337 ch03 6/24/04 12:44 PM Page 89

Page 10: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

90 CHAPTER 3 Character Set Support

Example: Suppose that column1 is defined as CHAR(5) CHARACTER SET latin2. If you do notsay SET NAMES or SET CHARACTER SET, then for SELECT column1 FROM t, the server will sendback all the values for column1 using the character set that the client specified when it con-nected. On the other hand, if you say SET NAMES ‘latin1’ or SET CHARACTER SET latin1,then just before sending results back, the server will convert the latin2 values to latin1.Conversion may be lossy if there are characters that are not in both character sets.

If you do not want the server to perform any conversion, set character_set_results to NULL:

mysql> SET character_set_results = NULL;

3.3.7 Character String Literal Character Set and CollationEvery character string literal has a character set and a collation, which may not be null.

A character string literal may have an optional character set introducer and COLLATE clause:

[_charset_name]’string’ [COLLATE collation_name]

Examples:

SELECT ‘string’;

SELECT _latin1’string’;

SELECT _latin1’string’ COLLATE latin1_danish_ci;

For the simple statement SELECT ‘string’, the string has the character set and collationdefined by the character_set_connection and collation_connection system variables.

The _charset_name expression is formally called an introducer. It tells the parser, “the stringthat is about to follow is in character set X.” Because this has confused people in the past, weemphasize that an introducer does not cause any conversion, it is strictly a signal that doesnot change the string’s value. An introducer is also legal before standard hex literal andnumeric hex literal notation (x’literal’ and 0xnnnn), and before ? (parameter substitutionwhen using prepared statements within a programming language interface).

Examples:

SELECT _latin1 x’AABBCC’;

SELECT _latin1 0xAABBCC;

SELECT _latin1 ?;

MySQL determines a literal’s character set and collation thus:

n If both _X and COLLATE Y were specified, then character set X and collation Yn If _X is specified but COLLATE is not specified, then character set X and its default

collationn Otherwise, the character set and collation given by the character_set_connection and

collation_connection system variables

04 6337 ch03 6/24/04 12:44 PM Page 90

Page 11: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

913.3 Determining the Default Character Set and Collation

Examples:

n A string with latin1 character set and latin1_german1_ci collation:SELECT _latin1’Müller’ COLLATE latin1_german1_ci;

n A string with latin1 character set and its default collation (that is, latin1_swedish_ci):SELECT _latin1’Müller’;

n A string with the connection default character set and collation:

SELECT ‘Müller’;

Character set introducers and the COLLATE clause are implemented according to standardSQL specifications.

3.3.8 Using COLLATE in SQL StatementsWith the COLLATE clause, you can override whatever the default collation is for a comparison.COLLATE may be used in various parts of SQL statements. Here are some examples:

n With ORDER BY:SELECT k

FROM t1

ORDER BY k COLLATE latin1_german2_ci;

n With AS:SELECT k COLLATE latin1_german2_ci AS k1

FROM t1

ORDER BY k1;

n With GROUP BY:SELECT k

FROM t1

GROUP BY k COLLATE latin1_german2_ci;

n With aggregate functions:SELECT MAX(k COLLATE latin1_german2_ci)

FROM t1;

n With DISTINCT:SELECT DISTINCT k COLLATE latin1_german2_ci

FROM t1;

n With WHERE:SELECT *

FROM t1

WHERE _latin1 ‘Müller’ COLLATE latin1_german2_ci = k;

04 6337 ch03 6/24/04 12:44 PM Page 91

Page 12: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

92 CHAPTER 3 Character Set Support

n With HAVING:SELECT k

FROM t1

GROUP BY k

HAVING k = _latin1 ‘Müller’ COLLATE latin1_german2_ci;

3.3.9 COLLATE Clause PrecedenceThe COLLATE clause has high precedence (higher than ||), so the following two expressionsare equivalent:

x || y COLLATE z

x || (y COLLATE z)

3.3.10 BINARY OperatorThe BINARY operator is a shorthand for a COLLATE clause. BINARY ‘x’ is equivalent to ‘x’ COLLATE

y, where y is the name of the binary collation for the character set of ‘x’. Every character set hasa binary collation. For example, the binary collation for the latin1 character set is latin1_bin, soif the column a is of character set latin1, the following two statements have the same effect:

SELECT * FROM t1 ORDER BY BINARY a;

SELECT * FROM t1 ORDER BY a COLLATE latin1_bin;

3.3.11 Some Special Cases Where the CollationDetermination Is TrickyIn the great majority of queries, it is obvious what collation MySQL uses to resolve a com-parison operation. For example, in the following cases, it should be clear that the collationwill be “the column collation of column x”:

SELECT x FROM T ORDER BY x;

SELECT x FROM T WHERE x = x;

SELECT DISTINCT x FROM T;

However, when multiple operands are involved, there can be ambiguity. For example:

SELECT x FROM T WHERE x = ‘Y’;

Should this query use the collation of the column x, or of the string literal ‘Y’?

Standard SQL resolves such questions using what used to be called “coercibility” rules. Theessence is: Because x and ‘Y’ both have collations, whose collation takes precedence? It’scomplex, but the following rules take care of most situations:

n An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)n A concatenation of two strings with different collations has a coercibility of 1.

04 6337 ch03 6/24/04 12:44 PM Page 92

Page 13: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

933.3 Determining the Default Character Set and Collation

n A column’s collation has a coercibility of 2.n A literal’s collation has a coercibility of 3.

Those rules resolve ambiguities thus:

n Use the collation with the lowest coercibility value.n If both sides have the same coercibility, then it is an error if the collations aren’t the

same.

Examples:

column1 = ‘A’ Use collation of column1

column1 = ‘A’ COLLATE x Use collation of ‘A’

column1 COLLATE x = ‘A’ COLLATE y Error

The COERCIBILITY() function can be used to determine the coercibility of a string expression:

mysql> SELECT COERCIBILITY(‘A’ COLLATE latin1_swedish_ci);

-> 0

mysql> SELECT COERCIBILITY(‘A’);

-> 3

See Section 5.8.3, “Information Functions.”

3.3.12 Collations Must Be for the Right Character SetRecall that each character set has one or more collations, and each collation is associatedwith one and only one character set. Therefore, the following statement causes an errormessage because the latin2_bin collation is not legal with the latin1 character set:

mysql> SELECT _latin1 ‘x’ COLLATE latin2_bin;

ERROR 1251: COLLATION ‘latin2_bin’ is not valid

for CHARACTER SET ‘latin1’

In some cases, expressions that worked before MySQL 4.1 fail as of MySQL 4.1 if you donot take character set and collation into account. For example, before 4.1, this statementworks as is:

mysql> SELECT SUBSTRING_INDEX(USER(),’@’,1);

+-------------------------------+

| SUBSTRING_INDEX(USER(),’@’,1) |

+-------------------------------+

| root |

+-------------------------------+

04 6337 ch03 6/24/04 12:44 PM Page 93

Page 14: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

94 CHAPTER 3 Character Set Support

After an upgrade to MySQL 4.1, the statement fails:

mysql> SELECT SUBSTRING_INDEX(USER(),’@’,1);

ERROR 1267 (HY000): Illegal mix of collations

(utf8_general_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE)

for operation ‘substr_index’

The reason this occurs is that usernames are stored using UTF8 (see Section 3.6, “UTF8for Metadata”). As a result, the USER() function and the literal string ‘@’ have different char-acter sets (and thus different collations):

mysql> SELECT COLLATION(USER()), COLLATION(‘@’);

+-------------------+-------------------+

| COLLATION(USER()) | COLLATION(‘@’) |

+-------------------+-------------------+

| utf8_general_ci | latin1_swedish_ci |

+-------------------+-------------------+

One way to deal with this is to tell MySQL to interpret the literal string as utf8:

mysql> SELECT SUBSTRING_INDEX(USER(),_utf8’@’,1);

+------------------------------------+

| SUBSTRING_INDEX(USER(),_utf8’@’,1) |

+------------------------------------+

| root |

+------------------------------------+

Another way is to change the connection character set and collation to utf8. You can do thatwith SET NAMES ‘utf8’ or by setting the character_set_connection and collation_connectionsystem variables directly.

3.3.13 An Example of the Effect of CollationSuppose that column X in table T has these latin1 column values:

Muffler

Müller

MX Systems

MySQL

And suppose that the column values are retrieved using the following statement:

SELECT X FROM T ORDER BY X COLLATE collation_name;

04 6337 ch03 6/24/04 12:44 PM Page 94

Page 15: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

953.4 Operations Affected by Character Set Support

The resulting order of the values for different collations is shown in this table:

latin1_swedish_ci latin1_german1_ci latin1_german2_ci

Muffler Muffler Müller

MX Systems Müller Muffler

Müller MX Systems MX Systems

MySQL MySQL MySQL

The table is an example that shows what the effect would be if we used different collationsin an ORDER BY clause. The character that causes the different sort orders in this example isthe U with two dots over it, which the Germans call U-umlaut, but we’ll call it U-dieresis.

n The first column shows the result of the SELECT using the Swedish/Finnish collatingrule, which says that U-dieresis sorts with Y.

n The second column shows the result of the SELECT using the German DIN-1 rule,which says that U-dieresis sorts with U.

n The third column shows the result of the SELECT using the German DIN-2 rule, whichsays that U-dieresis sorts with UE.

Three different collations, three different results. That’s what MySQL is here to handle. Byusing the appropriate collation, you can choose the sort order you want.

3.4 Operations Affected by Character SetSupportThis section describes operations that take character set information into account as ofMySQL 4.1.

3.4.1 Result StringsMySQL has many operators and functions that return a string. This section answers thequestion: What is the character set and collation of such a string?

For simple functions that take string input and return a string result as output, the output’scharacter set and collation are the same as those of the principal input value. For example,UPPER(X) returns a string whose character string and collation are the same as that of X. Thesame applies for INSTR(), LCASE(), LOWER(), LTRIM(), MID(), REPEAT(), REPLACE(), REVERSE(),RIGHT(), RPAD(), RTRIM(), SOUNDEX(), SUBSTRING(), TRIM(), UCASE(), and UPPER(). (Also note:The REPLACE() function, unlike all other functions, ignores the collation of the string inputand performs a case-insensitive comparison every time.)

04 6337 ch03 6/24/04 12:44 PM Page 95

Page 16: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

96 CHAPTER 3 Character Set Support

For operations that combine multiple string inputs and return a single string output, the“aggregation rules” of standard SQL apply:

n If an explicit COLLATE X occurs, then use Xn If an explicit COLLATE X and COLLATE Y occur, then errorn Otherwise, if all collations are X, then use Xn Otherwise, the result has no collation

For example, with CASE ... WHEN a THEN b WHEN b THEN c COLLATE X END, the resultantcollation is X. The same applies for CASE, UNION, ||, CONCAT(), ELT(), GREATEST(), IF(), andLEAST().

For operations that convert to character data, the character set and collation of the stringsthat result from the operations are defined by the character_set_connection and collation_connection system variables. This applies for CAST(), CHAR(), CONV(), FORMAT(), HEX(), andSPACE().

3.4.2 CONVERT()CONVERT() provides a way to convert data between different character sets. The syntax is:

CONVERT(expr USING transcoding_name)

In MySQL, transcoding names are the same as the corresponding character set names.

Examples:

SELECT CONVERT(_latin1’Müller’ USING utf8);

INSERT INTO utf8table (utf8column)

SELECT CONVERT(latin1field USING utf8) FROM latin1table;

CONVERT(... USING ...) is implemented according to the standard SQL specification.

3.4.3 CAST()You may also use CAST() to convert a string to a different character set. The syntax is:

CAST(character_string AS character_data_type CHARACTER SET charset_name)

Example:

SELECT CAST(_latin1’test’ AS CHAR CHARACTER SET utf8);

If you use CAST() without specifying CHARACTER SET, the resulting character set and collationare defined by the character_set_connection and collation_connection system variables. Ifyou use CAST() with CHARACTER SET X, then the resulting character set and collation are Xand the default collation of X.

04 6337 ch03 6/24/04 12:44 PM Page 96

Page 17: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

973.4 Operations Affected by Character Set Support

You may not use a COLLATE clause inside a CAST(), but you may use it outside. That is,CAST(... COLLATE ...) is illegal, but CAST(...) COLLATE ... is legal.

Example:

SELECT CAST(_latin1’test’ AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;

3.4.4 SHOW StatementsSeveral SHOW statements are new or modified in MySQL 4.1 to provide additional characterset information. SHOW CHARACTER SET, SHOW COLLATION, and SHOW CREATE DATABASE are new.SHOW CREATE TABLE and SHOW COLUMNS are modified.

The SHOW CHARACTER SET command shows all available character sets. It takes an optionalLIKE clause that indicates which character set names to match. For example:

mysql> SHOW CHARACTER SET LIKE ‘latin%’;

+---------+-----------------------------+-------------------+--------+

| Charset | Description | Default collation | Maxlen |

+---------+-----------------------------+-------------------+--------+

| latin1 | ISO 8859-1 West European | latin1_swedish_ci | 1 |

| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |

| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |

| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |

+---------+-----------------------------+-------------------+--------+

See Section 6.5.3.2, “SHOW CHARACTER SET Syntax.”

The output from SHOW COLLATION includes all available character sets. It takes an optionalLIKE clause that indicates which collation names to match. For example:

mysql> SHOW COLLATION LIKE ‘latin1%’;

+-------------------+---------+----+---------+----------+---------+

| Collation | Charset | Id | Default | Compiled | Sortlen |

+-------------------+---------+----+---------+----------+---------+

| latin1_german1_ci | latin1 | 5 | | | 0 |

| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 0 |

| latin1_danish_ci | latin1 | 15 | | | 0 |

| latin1_german2_ci | latin1 | 31 | | Yes | 2 |

| latin1_bin | latin1 | 47 | | Yes | 0 |

| latin1_general_ci | latin1 | 48 | | | 0 |

| latin1_general_cs | latin1 | 49 | | | 0 |

| latin1_spanish_ci | latin1 | 94 | | | 0 |

+-------------------+---------+----+---------+----------+---------+

04 6337 ch03 6/24/04 12:44 PM Page 97

Page 18: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

98 CHAPTER 3 Character Set Support

See Section 6.5.3.3, “SHOW COLLATION Syntax.”

SHOW CREATE DATABASE displays the CREATE DATABASE statement that will create a given data-base. The result includes all database options. DEFAULT CHARACTER SET and COLLATE are sup-ported. All database options are stored in a text file named db.opt that can be found in thedatabase directory.

mysql> SHOW CREATE DATABASE a\G

*************************** 1. row ***************************

Database: a

Create Database: CREATE DATABASE `a`

/*!40100 DEFAULT CHARACTER SET macce */

See Section 6.5.3.5, “SHOW CREATE DATABASE Syntax.”

SHOW CREATE TABLE is similar, but displays the CREATE TABLE statement to create a giventable. The column definitions now indicate any character set specifications, and the tableoptions include character set information.

See Section 6.5.3.6, “SHOW CREATE TABLE Syntax.”

The SHOW COLUMNS statement displays the collations of a table’s columns when invoked asSHOW FULL COLUMNS. Columns with CHAR, VARCHAR, or TEXT data types have non-NULL colla-tions. Numeric and other non-character types have NULL collations. For example:

mysql> SHOW FULL COLUMNS FROM t;

+-------+---------+------------+------+-----+---------+-------+

| Field | Type | Collation | Null | Key | Default | Extra |

+-------+---------+------------+------+-----+---------+-------+

| a | char(1) | latin1_bin | YES | | NULL | |

| b | int(11) | NULL | YES | | NULL | |

+-------+---------+------------+------+-----+---------+-------+

The character set is not part of the display. (The character set name is implied by the colla-tion name.)

See Section 6.5.3.4, “SHOW COLUMNS Syntax.”

3.5 Unicode SupportAs of MySQL version 4.1, there are two new character sets for storing Unicode data:

n ucs2, the UCS-2 Unicode character set.n utf8, the UTF8 encoding of the Unicode character set.

04 6337 ch03 6/24/04 12:44 PM Page 98

Page 19: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

993.6 UTF8 for Metadata

In UCS-2 (binary Unicode representation), every character is represented by a two-byteUnicode code with the most significant byte first. For example: “LATIN CAPITALLETTER A” has the code 0x0041 and it’s stored as a two-byte sequence: 0x00 0x41.“CYRILLIC SMALL LETTER YERU” (Unicode 0x044B) is stored as a two-bytesequence: 0x04 0x4B. For Unicode characters and their codes, please refer to theUnicode Home Page (http://www.unicode.org/).

A temporary restriction is that UCS-2 cannot yet be used as a client character set. Thatmeans that SET NAMES ‘ucs2’ will not work.

The UTF8 character set (transform Unicode representation) is an alternative way to storeUnicode data. It is implemented according to RFC2279. The idea of the UTF8 characterset is that various Unicode characters fit into byte sequences of different lengths:

n Basic Latin letters, digits, and punctuation signs use one byte.n Most European and Middle East script letters fit into a two-byte sequence: extended

Latin letters (with tilde, macron, acute, grave, and other accents), Cyrillic, Greek,Armenian, Hebrew, Arabic, Syriac, and others.

n Korean, Chinese, and Japanese ideographs use three-byte sequences.

Currently, MySQL UTF8 support does not include four-byte sequences.

Tip: To save space with UTF8, use VARCHAR instead of CHAR. Otherwise, MySQL has toreserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column, because that’s the maximumpossible length.

3.6 UTF8 for MetadataThe metadata is the data about the data. Anything that describes the database, as opposed tobeing the contents of the database, is metadata. Thus column names, database names, user-names, version names, and most of the string results from SHOW are metadata.

Representation of metadata must satisfy these requirements:

n All metadata must be in the same character set. Otherwise, SHOW wouldn’t work properlybecause different rows in the same column would be in different character sets.

n Metadata must include all characters in all languages. Otherwise, users wouldn’t be ableto name columns and tables in their own languages.

In order to satisfy both requirements, MySQL stores metadata in a Unicode character set,namely UTF8. This will not cause any disruption if you never use accented characters. Butif you do, you should be aware that metadata is in UTF8.

This means that the USER(), CURRENT_USER(), and VERSION() functions will have the UTF8character set by default. So will any synonyms, such as the SESSION_USER() and SYSTEM_USER()synonyms for USER().

04 6337 ch03 6/24/04 12:44 PM Page 99

Page 20: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

100 CHAPTER 3 Character Set Support

The server sets the character_set_system system variable to the name of the metadata char-acter set:

mysql> SHOW VARIABLES LIKE ‘character_set_system’;

+----------------------+-------+

| Variable_name | Value |

+----------------------+-------+

| character_set_system | utf8 |

+----------------------+-------+

Storage of metadata using Unicode does not mean that the headers of columns and theresults of DESCRIBE functions will be in the character_set_system character set by default.When you say SELECT column1 FROM t, the name column1 itself will be returned from theserver to the client in the character set as determined by the SET NAMES statement. Morespecifically, the character set used is determined by the value of the character_set_resultssystem variable. If this variable is set to NULL, no conversion is performed and the serverreturns metadata using its original character set (the set indicated by character_set_system).

If you want the server to pass metadata results back in a non-UTF8 character set, then useSET NAMES to force the server to perform character set conversion (see Section 3.3.6,“Connection Character Sets and Collations”), or else set the client to do the conversion. Itis always more efficient to set the client to do the conversion, but this option will not beavailable for many clients until late in the MySQL 4.x product cycle.

If you are just using, for example, the USER() function for comparison or assignment withina single statement, don’t worry. MySQL will do some automatic conversion for you.

SELECT * FROM Table1 WHERE USER() = latin1_column;

This will work because the contents of latin1_column are automatically converted to UTF8before the comparison.

INSERT INTO Table1 (latin1_column) SELECT USER();

This will work because the contents of USER() are automatically converted to latin1 beforethe assignment. Automatic conversion is not fully implemented yet, but should work cor-rectly in a later version.

Although automatic conversion is not in the SQL standard, the SQL standard documentdoes say that every character set is (in terms of supported characters) a “subset” ofUnicode. Since it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.

04 6337 ch03 6/24/04 12:44 PM Page 100

Page 21: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

1013.10 Upgrading from MySQL 4.0

3.7 Compatibility with Other DBMSsFor MaxDB compatibility these two statements are the same:

CREATE TABLE t1 (f1 CHAR(n) UNICODE);

CREATE TABLE t1 (f1 CHAR(n) CHARACTER SET ucs2);

3.8 New Character Set Configuration File FormatIn MySQL 4.1, character set configuration is stored in XML files, one file per character set.In previous versions, this information was stored in .conf files.

3.9 National Character SetBefore MySQL 4.1, NCHAR and CHAR were synonymous. ANSI defines NCHAR or NATIONALCHAR as a way to indicate that a CHAR column should use some predefined character set.MySQL 4.1 and up uses utf8 as that predefined character set. For example, these columntype declarations are equivalent:

CHAR(10) CHARACTER SET utf8

NATIONAL CHARACTER(10)

NCHAR(10)

As are these:

VARCHAR(10) CHARACTER SET utf8

NATIONAL VARCHAR(10)

NCHAR VARCHAR(10)

NATIONAL CHARACTER VARYING(10)

NATIONAL CHAR VARYING(10)

You can use N’literal’ to create a string in the national character set. These two statementsare equivalent:

SELECT N’some text’;

SELECT _utf8’some text’;

3.10 Upgrading Character Sets from MySQL 4.0Now, what about upgrading from older versions of MySQL? MySQL 4.1 is almost upwardcompatible with MySQL 4.0 and earlier for the simple reason that almost all the features arenew, so there’s nothing in earlier versions to conflict with. However, there are some differ-ences and a few things to be aware of.

04 6337 ch03 6/24/04 12:44 PM Page 101

Page 22: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

102 CHAPTER 3 Character Set Support

Most important: The “MySQL 4.0 character set” has the properties of both “MySQL 4.1character sets” and “MySQL 4.1 collations.” You will have to unlearn this. Henceforth, wewill not bundle character set/collation properties in the same conglomerate object.

There is a special treatment of national character sets in MySQL 4.1. NCHAR is not the sameas CHAR, and N’...’ literals are not the same as ‘...’ literals.

Finally, there is a different file format for storing information about character sets and colla-tions. Make sure that you have reinstalled the /share/mysql/charsets/ directory containingthe new configuration files.

If you want to start mysqld from a 4.1.x distribution with data created by MySQL 4.0, youshould start the server with the same character set and collation. In this case you won’t needto reindex your data.

There are two ways to do so:

shell> ./configure --with-charset=... --with-collation=...

shell> ./mysqld --default-character-set=... --default-collation=...

If you used mysqld with, for example, the MySQL 4.0 danish character set, you should nowuse the latin1 character set and the latin1_danish_ci collation:

shell> ./configure --with-charset=latin1 \

--with-collation=latin1_danish_ci

shell> ./mysqld --default-character-set=latin1 \

--default-collation=latin1_danish_ci

Use the table shown in Section 3.10.1, “4.0 Character Sets and Corresponding 4.1Character Set/Collation Pairs,” to find old 4.0 character set names and their 4.1 characterset/collation pair equivalents.

If you have non-latin1 data stored in a 4.0 latin1 table and want to convert the table column definitions to reflect the actual character set of the data, use the instructions inSection 3.10.2, “Converting 4.0 Character Columns to 4.1 Format.”

3.10.1 4.0 Character Sets and Corresponding 4.1 CharacterSet/Collation Pairs

ID 4.0 Character Set 4.1 Character Set 4.1 Collation

1 big5 big5 big5_chinese_ci

2 czech latin2 latin2_czech_ci

3 dec8 dec8 dec8_swedish_ci

4 dos cp850 cp850_general_ci

5 german1 latin1 latin1_german1_ci

6 hp8 hp8 hp8_english_ci

7 koi8_ru koi8r koi8r_general_ci

04 6337 ch03 6/24/04 12:44 PM Page 102

Page 23: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

1033.10 Upgrading from MySQL 4.0

ID 4.0 Character Set 4.1 Character Set 4.1 Collation

8 latin1 latin1 latin1_swedish_ci

9 latin2 latin2 latin2_general_ci

10 swe7 swe7 swe7_swedish_ci

11 usa7 ascii ascii_general_ci

12 ujis ujis ujis_japanese_ci

13 sjis sjis sjis_japanese_ci

14 cp1251 cp1251 cp1251_bulgarian_ci

15 danish latin1 latin1_danish_ci

16 hebrew hebrew hebrew_general_ci

17 win1251 (removed) (removed)

18 tis620 tis620 tis620_thai_ci

19 euc_kr euckr euckr_korean_ci

20 estonia latin7 latin7_estonian_ci

21 hungarian latin2 latin2_hungarian_ci

22 koi8_ukr koi8u koi8u_ukrainian_ci

23 win1251ukr cp1251 cp1251_ukrainian_ci

24 gb2312 gb2312 gb2312_chinese_ci

25 greek greek greek_general_ci

26 win1250 cp1250 cp1250_general_ci

27 croat latin2 latin2_croatian_ci

28 gbk gbk gbk_chinese_ci

29 cp1257 cp1257 cp1257_lithuanian_ci

30 latin5 latin5 latin5_turkish_ci

31 latin1_de latin1 latin1_german2_ci

3.10.2 Converting 4.0 Character Columns to 4.1 FormatNormally, the server runs using the latin1 character set by default. If you have been storingcolumn data that actually is in some other character set that the 4.1 server now supportsdirectly, you can convert the column. However, you should avoid trying to convert directlyfrom latin1 to the “real” character set. This may result in data loss. Instead, convert the col-umn to a binary column type, and then from the binary type to a non-binary type with thedesired character set. Conversion to and from binary involves no attempt at character valueconversion and preserves your data intact. For example, suppose that you have a 4.0 tablewith three columns that are used to store values represented in latin1, latin2, and utf8:

CREATE TABLE t

(

latin1_col CHAR(50),

latin2_col CHAR(100),

utf8_col CHAR(150)

);

04 6337 ch03 6/24/04 12:44 PM Page 103

Page 24: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

104 CHAPTER 3 Character Set Support

After upgrading to MySQL 4.1, you want to convert this table to leave latin1_col alone butchange the latin2_col and utf8_col columns to have character sets of latin2 and utf8.First, back up your table, then convert the columns as follows:

ALTER TABLE t MODIFY latin2_col BINARY(100);

ALTER TABLE t MODIFY utf8_col BINARY(150);

ALTER TABLE t MODIFY latin2_col CHAR(100) CHARACTER SET latin2;

ALTER TABLE t MODIFY utf8_col CHAR(150) CHARACTER SET utf8;

The first two statements “remove” the character set information from the latin2_col andutf8_col columns. The second two statements assign the proper character sets to the twocolumns.

If you like, you can combine the to-binary conversions and from-binary conversions intosingle statements:

ALTER TABLE t

MODIFY latin2_col BINARY(100),

MODIFY utf8_col BINARY(150);

ALTER TABLE t

MODIFY latin2_col CHAR(100) CHARACTER SET latin2,

MODIFY utf8_col CHAR(150) CHARACTER SET utf8;

3.11 Character Sets and Collations That MySQLSupportsHere is an annotated list of character sets and collations that MySQL supports. Becauseoptions and installation settings differ, some sites might not have all items listed, and somesites might have items not listed.

MySQL supports 70+ collations for 30+ character sets. The character sets and their defaultcollations are displayed by the SHOW CHARACTER SET statement. (The output actually includesanother column that is not shown so that the example fits better on the page.)

mysql> SHOW CHARACTER SET;

+----------+-----------------------------+---------------------+

| Charset | Description | Default collation |

+----------+-----------------------------+---------------------+

| big5 | Big5 Traditional Chinese | big5_chinese_ci |

| dec8 | DEC West European | dec8_swedish_ci |

| cp850 | DOS West European | cp850_general_ci |

| hp8 | HP West European | hp8_english_ci |

| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |

| latin1 | ISO 8859-1 West European | latin1_swedish_ci |

| latin2 | ISO 8859-2 Central European | latin2_general_ci |

| swe7 | 7bit Swedish | swe7_swedish_ci |

| ascii | US ASCII | ascii_general_ci |

04 6337 ch03 6/24/04 12:44 PM Page 104

Page 25: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

1053.11 Character Sets and Collations That MySQL Supports

| ujis | EUC-JP Japanese | ujis_japanese_ci |

| sjis | Shift-JIS Japanese | sjis_japanese_ci |

| cp1251 | Windows Cyrillic | cp1251_bulgarian_ci |

| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |

| tis620 | TIS620 Thai | tis620_thai_ci |

| euckr | EUC-KR Korean | euckr_korean_ci |

| koi8u | KOI8-U Ukrainian | koi8u_general_ci |

| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |

| greek | ISO 8859-7 Greek | greek_general_ci |

| cp1250 | Windows Central European | cp1250_general_ci |

| gbk | GBK Simplified Chinese | gbk_chinese_ci |

| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |

| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |

| utf8 | UTF-8 Unicode | utf8_general_ci |

| ucs2 | UCS-2 Unicode | ucs2_general_ci |

| cp866 | DOS Russian | cp866_general_ci |

| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |

| macce | Mac Central European | macce_general_ci |

| macroman | Mac West European | macroman_general_ci |

| cp852 | DOS Central European | cp852_general_ci |

| latin7 | ISO 8859-13 Baltic | latin7_general_ci |

| cp1256 | Windows Arabic | cp1256_general_ci |

| cp1257 | Windows Baltic | cp1257_general_ci |

| binary | Binary pseudo charset | binary |

| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |

+----------+-----------------------------+---------------------+

3.11.1 Unicode Character SetsMySQL has two Unicode character sets. You can store texts in about 650 languages usingthese character sets. We have not added a large number of collations for these two new setsyet, but that will be happening soon. Currently, they have default case-insensitive accent-insensitive collations, plus the binary collation.

Currently, the ucs2_general_uca collation has only partial support for the Unicode CollationAlgorithm. Some characters are not supported yet.

n ucs2 (UCS-2 Unicode) collations:n ucs2_bin

n ucs2_general_ci (default)n ucs2_general_uca

n utf8 (UTF-8 Unicode) collations:n utf8_bin

n utf8_general_ci (default)

04 6337 ch03 6/24/04 12:44 PM Page 105

Page 26: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

106 CHAPTER 3 Character Set Support

3.11.2 West European Character SetsWest European Character Sets cover most West European languages, such as French,Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish,Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

n ascii (US ASCII) collations:n ascii_bin

n ascii_general_ci (default)n cp850 (DOS West European) collations:

n cp850_bin

n cp850_general_ci (default)n dec8 (DEC West European) collations:

n dec8_bin

n dec8_swedish_ci (default)n hp8 (HP West European) collations:

n hp8_bin

n hp8_english_ci (default)n latin1 (ISO 8859-1 West European) collations:

n latin1_bin

n latin1_danish_ci

n latin1_general_ci

n latin1_general_cs

n latin1_german1_ci

n latin1_german2_ci

n latin1_spanish_ci

n latin1_swedish_ci (default)

The latin1 is the default character set. The latin1_swedish_ci collation is the defaultthat probably is used by the majority of MySQL customers. It is constantly stated thatthis is based on the Swedish/Finnish collation rules, but you will find Swedes and Finnswho disagree with that statement.

The latin1_german1_ci and latin1_german2_ci collations are based on the DIN-1 andDIN-2 standards, where DIN stands for Deutsches Institut für Normung (that is, theGerman answer to ANSI). DIN-1 is called the dictionary collation and DIN-2 is calledthe phone-book collation.

n latin1_german1_ci (dictionary) rules:

‘Ä’ = ‘A’, ‘Ö’ = ‘O’, ‘Ü’ = ‘U’, ‘ß’ = ‘s’

04 6337 ch03 6/24/04 12:44 PM Page 106

Page 27: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

1073.11 Character Sets and Collations That MySQL Supports

n latin1_german2_ci (phone-book) rules:

‘Ä’ = ‘AE’, ‘Ö’ = ‘OE’, ‘Ü’ = ‘UE’, ‘ß’ = ‘ss’

In the latin1_spanish_ci collation, ‘Ñ’ (N-tilde) is a separate letter between ‘N’ and ‘O’.

n macroman (Mac West European) collations:n macroman_bin

n macroman_general_ci (default)n swe7 (7-bit Swedish) collations:

n swe7_bin

n swe7_swedish_ci (default)

3.11.3 Central European Character SetsWe have some support for character sets used in the Czech Republic, Slovakia, Hungary,Romania, Slovenia, Croatia, and Poland.

n cp1250 (Windows Central European) collations:n cp1250_bin

n cp1250_czech_ci

n cp1250_general_ci (default)n cp852 (DOS Central European) collations:

n cp852_bin

n cp852_general_ci (default)n keybcs2 (DOS Kamenicky Czech-Slovak) collations:

n keybcs2_bin

n keybcs2_general_ci (default)n latin2 (ISO 8859-2 Central European) collations:

n latin2_bin

n latin2_croatian_ci

n latin2_czech_ci

n latin2_general_ci (default)n latin2_hungarian_ci

n macce (Mac Central European) collations:n macce_bin

n macce_general_ci (default)

04 6337 ch03 6/24/04 12:44 PM Page 107

Page 28: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

108 CHAPTER 3 Character Set Support

3.11.4 South European and Middle East Character Setsn armscii8 (ARMSCII-8 Armenian) collations:

n armscii8_bin

n armscii8_general_ci (default)n cp1256 (Windows Arabic) collations:

n cp1256_bin

n cp1256_general_ci (default)n geostd8 (GEOSTD8 Georgian) collations:

n geostd8_bin

n geostd8_general_ci (default)n greek (ISO 8859-7 Greek) collations:

n greek_bin

n greek_general_ci (default)n hebrew (ISO 8859-8 Hebrew) collations:

n hebrew_bin

n hebrew_general_ci (default)n latin5 (ISO 8859-9 Turkish) collations:

n latin5_bin

n latin5_turkish_ci (default)

3.11.5 Baltic Character SetsThe Baltic character sets cover Estonian, Latvian, and Lithuanian languages. There are twoBaltic character sets currently supported:

n cp1257 (Windows Baltic) collations:n cp1257_bin

n cp1257_general_ci (default)n cp1257_lithuanian_ci

n latin7 (ISO 8859-13 Baltic) collations:n latin7_bin

n latin7_estonian_cs

n latin7_general_ci (default)n latin7_general_cs

04 6337 ch03 6/24/04 12:44 PM Page 108

Page 29: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

1093.11 Character Sets and Collations That MySQL Supports

3.11.6 Cyrillic Character SetsHere are the Cyrillic character sets and collations for use with Belarusian, Bulgarian,Russian, and Ukrainian languages.

n cp1251 (Windows Cyrillic) collations:n cp1251_bin

n cp1251_bulgarian_ci

n cp1251_general_ci (default)n cp1251_general_cs

n cp1251_ukrainian_ci

n cp866 (DOS Russian) collations:n cp866_bin

n cp866_general_ci (default)n koi8r (KOI8-R Relcom Russian) collations:

n koi8r_bin

n koi8r_general_ci (default)n koi8u (KOI8-U Ukrainian) collations:

n koi8u_bin

n koi8u_general_ci (default)

3.11.7 Asian Character SetsThe Asian character sets that we support include Chinese, Japanese, Korean, and Thai.These can be complicated. For example, the Chinese sets must allow for thousands of differ-ent characters.

n big5 (Big5 Traditional Chinese) collations:n big5_bin

n big5_chinese_ci (default)n euckr (EUC-KR Korean) collations:

n euckr_bin

n euckr_korean_ci (default)n gb2312 (GB2312 Simplified Chinese) collations:

n gb2312_bin

n gb2312_chinese_ci (default)

04 6337 ch03 6/24/04 12:44 PM Page 109

Page 30: Character Set Support - Pearson Educationcharacter_set_databaseand collation_databasesystem variables. The server sets these variables whenever the default database changes. If there

110 CHAPTER 3 Character Set Support

n gbk (GBK Simplified Chinese) collations:n gbk_bin

n gbk_chinese_ci (default)n sjis (Shift-JIS Japanese) collations:

n sjis_bin

n sjis_japanese_ci (default)n tis620 (TIS620 Thai) collations:

n tis620_bin

n tis620_thai_ci (default)n ujis (EUC-JP Japanese) collations:

n ujis_bin

n ujis_japanese_ci (default)

04 6337 ch03 6/24/04 12:44 PM Page 110