Friday, May 08, 2020

Pinyin tone marks, mysql & accented characters

Most of the mysql UTF8 collations such as utf8_general_ci are case insensitive which also means that for sorting and comparisons, accented characters are treated as being the same even though their internal representation is different and they are returned by queries or displayed in the original form.

For the wirdz Chinese-English Dictionary this turns out to be quite convenient.  pinyin (which is a "Romanized" method of representing Chinese characters and pronunciation) has two variants: tone marks and tone numbers.

For example, the Chinese phrase  發表演講 which means to give a speech may be represented in pinyin as either  fā biǎo yǎn jiǎng   (tone marks) or  fa1 biao3 yan3 jiang3  (tone numbers).

For dictionary look ups which simply need to locate a term (and are not directly concerned with pronunciation), mysql's case insensitivity allows unaccented look ups directly while matching directly against the accented version held on the database without needing to have an additional column for use in searches and as well the need to pre-process search terms to remove any accents.  Since search terms can be entered by cut-and-paste, both accented and unaccented entries need to be covered.

So far so good, but there can also be the need to separate out the different accented versions.  While mysql does have case insensitive collations such as utf8_bin which can be applied at column level, this all gets a bit messy and could require the same data in two separate columns with different collations.  Collations in some circumstances can also be set within queries,  But this is also a tad messy.

If there are requirements to distinguish between accented characters, the hex function is a fairly simple option.

So, for example, if c is an accented column in table T with a character insensitive collation,  and there are are only two values of c, say "a" and "ā", then select distinct s from T will return only one row (which will contain only one of the two variants) whereas select distinct hex(s), s from T will return two rows and both values.

Thursday, May 07, 2020

There are a number of interesting old dictionaries which have been digitized such that they can be relatively easily processed and uploaded to populate the mysql database table stucture used by the wirdz™ dictionary engine.

The latest addition to wirdz.com is the Sailor's Word Book, a 19th century dictionary of nautical terms created by one admiral and edited by another. 

As always, the text was never intended to be easy for automated text process, which is this case would have been over 100 years after the book was produced.  The trick is to be able to identify where the main body of the dictionary starts, where it ends, when a new term starts, whether there is a part of speech included and where the definition starts.  The Victorian type setters and the folks who undertake the digitisation are in general both well disciplined folks with strong attention to detail and consistency and so, despite there always being a few edge cases, the digitisation, which in the case of content for wirdz.com uses variations on a core tool built in Python, can be relatively straight-forward.

Wednesday, May 06, 2020

Chinese characters and escapes with AJAX and PHP

Although most of the time, UTF-8 combined with the multi-byte settings on PHP seem to do the trick beneath the hood, there is always a corner case.  In order to support Chinese character search against the English-Chinese Dictionary or Kanji/Kana searches against the Japanese-English Dictionary passing Chinese and Japanese characters via the Ajax call to the PHP back-end process is needed.  When this was tested, the characters passed to the server process, which were escaped UTF-8 did not prove to be easily converted back to UTF-8 using any of the standard PHP functions.

For example, the Chinese character string 介質訪問控制層 arrived as:

%u4ECB%u8CEA%u8A2A%u554F%u63A7%u5236

when processed through the AJAX "get" method.

After eliminating all of the standard functions, the following steps finally did the trick:

1. Replacing % with \ in the input string to get it into the form of a "standard" escaped UTF-8 string.

2. Converting the string to a string representing a single element array, i.e. "['my utf8 string']"

3. Using the json_decode function on the string.

4. Extracting the converted text string from the returned array.

Not the most obvious way to handle the problem but simple in the end!  Because the string passed to the server by the AJAX call is subject to character filtering, the risk of a false positive getting misinterpreted by the json function is minimized.