Pinyin tone marks, mysql & accented characters
Most of the mysql UTF8 collations such as utf8_general_ci are case insensitive which also means that for sorting and comparisons, accented characters are treated as being the same even though their internal representation is different and they are returned by queries or displayed in the original form.
For the wirdz Chinese-English Dictionary this turns out to be quite convenient. pinyin (which is a "Romanized" method of representing Chinese characters and pronunciation) has two variants: tone marks and tone numbers.
For example, the Chinese phrase 發表演講 which means to give a speech may be represented in pinyin as either fā biǎo yǎn jiǎng (tone marks) or fa1 biao3 yan3 jiang3 (tone numbers).
For dictionary look ups which simply need to locate a term (and are not directly concerned with pronunciation), mysql's case insensitivity allows unaccented look ups directly while matching directly against the accented version held on the database without needing to have an additional column for use in searches and as well the need to pre-process search terms to remove any accents. Since search terms can be entered by cut-and-paste, both accented and unaccented entries need to be covered.
So far so good, but there can also be the need to separate out the different accented versions. While mysql does have case insensitive collations such as utf8_bin which can be applied at column level, this all gets a bit messy and could require the same data in two separate columns with different collations. Collations in some circumstances can also be set within queries, But this is also a tad messy.
If there are requirements to distinguish between accented characters, the hex function is a fairly simple option.
So, for example, if c is an accented column in table T with a character insensitive collation, and there are are only two values of c, say "a" and "ā", then select distinct s from T will return only one row (which will contain only one of the two variants) whereas select distinct hex(s), s from T will return two rows and both values.
For the wirdz Chinese-English Dictionary this turns out to be quite convenient. pinyin (which is a "Romanized" method of representing Chinese characters and pronunciation) has two variants: tone marks and tone numbers.
For example, the Chinese phrase 發表演講 which means to give a speech may be represented in pinyin as either fā biǎo yǎn jiǎng (tone marks) or fa1 biao3 yan3 jiang3 (tone numbers).
For dictionary look ups which simply need to locate a term (and are not directly concerned with pronunciation), mysql's case insensitivity allows unaccented look ups directly while matching directly against the accented version held on the database without needing to have an additional column for use in searches and as well the need to pre-process search terms to remove any accents. Since search terms can be entered by cut-and-paste, both accented and unaccented entries need to be covered.
So far so good, but there can also be the need to separate out the different accented versions. While mysql does have case insensitive collations such as utf8_bin which can be applied at column level, this all gets a bit messy and could require the same data in two separate columns with different collations. Collations in some circumstances can also be set within queries, But this is also a tad messy.
If there are requirements to distinguish between accented characters, the hex function is a fairly simple option.
So, for example, if c is an accented column in table T with a character insensitive collation, and there are are only two values of c, say "a" and "ā", then select distinct s from T will return only one row (which will contain only one of the two variants) whereas select distinct hex(s), s from T will return two rows and both values.
0 Comments:
Post a Comment
<< Home