Friday, March 17, 2006

MySQL string splitter

The latest dictionary to be added to the wirdz/AMCD site is an English-Japanese dictionary based on the EDICT Japanese-English dictionary. This dictionary is made available in the form of a delimiter separated text file with a primary delimiter character of "/".

Because the original file is Japanese to English, it may have several English equivalents, separated as above, for a single Japanese terms (up to 11).

To restructure the data as English-Japanese sorted alphabetically, it was necessary to create separate entries corresponding to each English term. This was done on the database side (much faster once the initial data load to the remote host server was completed - this took some time as the original dictionary has >100000 entries).

MySQL doesn't have a direct equivalent to a "split" function/method for strings but it does have the substring_index function which can be used to achieve the same effect as shown below.

Using:
select replace(substring(substring_index('a,b,c', ',', 1),
                         length(substring_index('a,b,c', 
                                                ',', 
                                                1 - 1)) + 1),
               ',', '') ITEM1,
       replace(substring(substring_index('a,b,c', ',', 2),
                         length(substring_index('a,b,c', 
                                                ',', 
                                                2 - 1)) + 1),
               ',', '') ITEM2,
       replace(substring(substring_index('a,b,c', ',', 3),
                         length(substring_index('a,b,c', 
                                                ',', 
                                                3 - 1)) + 1),
               ',', '') ITEM3,
       replace(substring(substring_index('a,b,c', ',', 4),
                        length(substring_index('a,b,c', 
                        ',', 4 - 1)) + 1),
               ',', '') ITEM4

the returned values are: 'a' 'b' 'c' and an empty string as required.

The above formula doesn't need the first item to be handled as a special case and returns empty strings correctly when the item count is less than the position requested.

Monday, March 13, 2006

PHP Unicode processing

One of the next dictionaries to be added to the the wirdz dictionary engine will be in Chinese. One of the objectives is to provide full text searching against pinyin terms excluding the tone marks (see previous blog entry for more on pinyin).

For example, where the pinyin is jiésuàn to allow the search to match against jiesuan; i.e. to "detone" the pinyin.

Because the PHP needs to work on servers where the PHP multi-byte functions are not available, the code needs to process the Unicode directly.

The following function provide a generic mapping capablity which will replace one Unicode character directly with the replacement define in a mapping array when the mapping is UTF-8. (Since PHP provides a function to convert from one Unicode encoding to another, it is not necessary to handle any other encodings.)
function unicodeToOrdinal ($c) {
  $nOrd = 0;
  if (ord($c{0})>=0 && ord($c{0})<=127)  
    $nOrd = $c{0};
  if (ord($c{0})>=192 && ord($c{0})<=223) 
    $nOrd = (ord($c{0})-192)*64 + (ord($c{1})-128);
  if (ord($c{0})>=224 && ord($c{0})<=239) 
    $nOrd = (ord($c{0})-224)*4096 
          + (ord($c{1})-128)*64 
          + (ord($c{2})-128);
  if (ord($c{0})>=240 && ord($c{0})<=247) 
    $nOrd = (ord($c{0})-240)*262144 
          + (ord($c{1})-128)*4096 
          + (ord($c{2})-128)*64 
          + (ord($c{3})-128);
  if (ord($c{0})>=248 && ord($c{0})<=251) 
    $nOrd = (ord($c{0})-248)*16777216 
          + (ord($c{1})-128)*262144 
          + (ord($c{2})-128)*4096 
          + (ord($c{3})-128)*64 
          + (ord($c{4})-128);
  if (ord($c{0})>=252 && ord($c{0})<=253) 
    $nOrd = (ord($c{0})-252)*1073741824 
          + (ord($c{1})-128)*16777216 
          + (ord($c{2})-128)*262144 
          + (ord($c{3})-128)*4096 
          + (ord($c{4})-128)*64 
          + (ord($c{5})-128);
  if (ord($c{0})>=254 && ord($c{0})<=255) 
    $nOrd = false; //error
  return $nOrd;
}

function mapUTF8Chars ($strIn, $aMappingTable) {
  $strOut = "";
  $iPos = 0;
  $nLen = strlen ($strIn);  
  while ($iPos < $nLen) {
    $bAscii = true;
    $nAscii = ord (substr ($strIn, $iPos, 1));
    if (($nAscii >= 240) && ($nAsci <= 255)) {
      // 4 chars representing one unicode character
      $strChar = substr ($strIn, $iPos, 4);
      $nOrd = unicodeToOrdinal($strChar);
      $bAscii = false;
      $iPos += 4;
    }
    else if (($nAscii >= 224) && ($nAscii <= 239)) {
      // 3 chars representing one unicode character
      $strChar = substr ($source, $iPos, 3);
      $nOrd = unicodeToOrdinal($strChar);
      $bAscii = false;
      $iPos += 3;
    }
    else if (($nAscii >= 192) && ($nAscii <= 223)) {
      // 2 chars representing one unicode character
      $strChar = substr ($strIn, $iPos, 2);
      $nOrd = unicodeToOrdinal($strChar);
      $bAscii = false;
      $iPos += 2;
    }
    else {
      // 1 char (lower ascii)
      $strChar = substr ($strIn, $iPos, 1);
      $nOrd = ord($strChar);
      $iPos += 1; 
    }
    if ($bAscii) {
      $strOut .= $strChar;
    }
    else {  
      $strMappedChar = $aMappingTable[$nOrd];
      if ($strMappedChar != "") {
        $strOut .= $strMappedChar;
      }
      else {
        $strOut .= $strChar;
      }   
    }
  }
  return $strOut;
}

The mapping table format is as follows (based on a subset of the "detoning" table):
$aPinyinMapping = array (
  "97"  => "a",
  "257" => "a",
  "225" => "a",
  "462" => "a",
  "224" => "a");

In addition to the specific operation implemented, the function also provides the basic logic/approach needed to process UTF-8 encoded Unicode.

Tuesday, March 07, 2006

Unicode URL Escapes in PHP

(The wirdz online dictionary can be found at wirdz.com.)

The standard method in for getting the value of a URL parameter is to use the $_GET array which is automatically created.

Unfortunately, the process which generates the array assumes that the URL only contains ASCII and escaped ASCII values. i.e. it doesn't handle Unicode escapes properly. So you need to get the value directly from the query string.

The following function returns a query string value (the alternative is to create an array first - more efficient for a long query string with more parameters):



function getQueryParameter ($strParam) {
  $strQueryString = $_SERVER['QUERY_STRING'];
  $aParamList = explode('&',$strQueryString);
  $i = 0;
  while ($i < count($aParamList)) {
    $aParam = split('=', $aParamList[$i]);
    if ($strParam == $aParam[0]) {
      return $aParam[1];
    } 
    $i++; 
  }
  return "";
}


The following function pair can be used then the convert the parameter value to Unicode (note in order for the Unicode values to display correctly on a generated HMTL page - the meta command <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> needs to be included in the page.):



function code2utf($num){
  if($num<128 br="">    return chr($num);
  if($num<2048 br="">    return chr(($num>>6)+192)
          .chr(($num&63)+128);
  if($num<65536 br="">    return chr(($num>>12)+224)
          .chr((($num>>6)&63)+128)
          .chr(($num&63)+128);
  if($num<2097152 br="">    return chr(($num>>18)+240)
          .chr((($num>>12)&63)+128)
          .chr((($num>>6)&63)+128)
          .chr(($num&63)+128);
  return '';
}

function unescape($strIn, $iconv_to = 'UTF-8') {
  $strOut = '';
  $iPos = 0;
  $len = strlen ($strIn);
  while ($iPos < $len) {
    $charAt = substr ($strIn, $iPos, 1);
    if ($charAt == '%') {
      $iPos++;
      $charAt = substr ($strIn, $iPos, 1);
      if ($charAt == 'u') {
        // Unicode character
        $iPos++;
        $unicodeHexVal = substr ($strIn, $iPos, 4);
        $unicode = hexdec ($unicodeHexVal);
        $strOut .= code2utf($unicode);
        $iPos += 4;
      }
      else {
        // Escaped ascii character
        $hexVal = substr ($strIn, $iPos, 2);
        if (hexdec($hexVal) > 127) {
          // Convert to Unicode 
          $strOut .= code2utf(hexdec ($hexVal));
        }
        else {
          $strOut .= chr (hexdec ($hexVal));
        }
        $iPos += 2;
      }
    }
    else {
      $strOut .= $charAt;
      $iPos++;
    }
  }
  if ($iconv_to != "UTF-8") {
    $strOut = iconv("UTF-8", $iconv_to, $strOut);
  }   
  return $strOut;
}


This function was adapted from previously published code. It will handle either Unicode or ASCII escapes correctly.

Escape and Unescape - JavaScript

(The wirdz online dictionary can be found at wirdz.com.)

With the wirdz dictionary engine, as you change dictionaries, the term you are looking for (or the initial letters of the term) are transferred through to the next dictionary via the URLusing a bit of JavaScript intervention. All well and good if the term is in plain ASCII. But what about Unicode charsets?

JavaScript provide 2 functions "escape" and "unescape" to convert from Unicode to URL safe escape codes and back. Here's an example based on "pinyin" which is used to represent Chinese in a roman font with "tone" marks to express the pronunciation:

Pinyin:

yīngguó diànzǐ jiésuàn xìtǒng


Escaped pinyin:

y%u012Bnggu%F3%20di%E0nz%u01D0%20ji
%E9su%E0n%20x%ECt%u01D2ng


The escape function uses Unicode escapes (%u + a four digit hex value) for Unicode characters not included in the ASCII range 128-255 but used standard ASCII escapes for characters in the range 128-255. No doubt for good backwards compatibility reasons.

The dictionary engine needs to convert back to Unicode on the server. (More on the way in which PHP handles Unicode escapes in URLs in a later entry).

It is cleaner to ensure that all escaped characters are Unicode encoded. To achieve this the following JavasScript function can be used:


function unicodeEscape (pstrString) {
  if (pstrString == "") {
    return "";
  }
  var iPos = 0;
  var strOut = "";
  var strChar;
  var strString = escape(pstrString);
  while (iPos < strString.length) {
    strChar = strString.substr(iPos, 1);
    if (strChar == "%") {
      strNextChar = strString.substr(iPos + 1, 1);
      if (strNextChar == "u") {
        strOut += strString.substr(iPos, 6);
        iPos += 6; 
      }
      else {
        strOut += "%u00" + 
                  strString.substr(iPos + 1, 2);
        iPos += 3;
      }
    }
    else {
      strOut += strChar;
      iPos++;
    }
  }
  return strOut;
}


Using this, the above pinyin becomes:

UTF8 escaped pinyin:

y%u012Bnggu%u00F3%u0020di%u00E0nz%u01D0
%u0020ji%u00E9su%u00E0n%u0020x%u00ECt%u01D2ng

Beginings ...

The objective is to create the web's best and easiest to use online dictionary engine.

The dictionary can be found at wirdz.com. As time goes by, additional dictionary content will be added.

The challenges:

(a) Mastering the the various little wrinkles necessary to get a multi-language, multi-character set dictionary engine to do all that we want it to do. The chosen technology platform is: MySQL, PHP, Javascript and Ajax (more of this later) with full Unicode support for presentation, searching etc.

(b) Mastering the search engine maze to ensure high placement. It's definitely no longer a case of simply "if you build it, they will come". It might have worked for Kevin Costner but let's face it, we are not building a baseball park in Idaho whatever you might have thought.

And why Atamyrat ... well that's another story.

So let's get blogging ...