Monday, March 13, 2006

PHP Unicode processing

One of the next dictionaries to be added to the the wirdz dictionary engine will be in Chinese. One of the objectives is to provide full text searching against pinyin terms excluding the tone marks (see previous blog entry for more on pinyin).

For example, where the pinyin is jiésuàn to allow the search to match against jiesuan; i.e. to "detone" the pinyin.

Because the PHP needs to work on servers where the PHP multi-byte functions are not available, the code needs to process the Unicode directly.

The following function provide a generic mapping capablity which will replace one Unicode character directly with the replacement define in a mapping array when the mapping is UTF-8. (Since PHP provides a function to convert from one Unicode encoding to another, it is not necessary to handle any other encodings.)
function unicodeToOrdinal ($c) {
  $nOrd = 0;
  if (ord($c{0})>=0 && ord($c{0})<=127)  
    $nOrd = $c{0};
  if (ord($c{0})>=192 && ord($c{0})<=223) 
    $nOrd = (ord($c{0})-192)*64 + (ord($c{1})-128);
  if (ord($c{0})>=224 && ord($c{0})<=239) 
    $nOrd = (ord($c{0})-224)*4096 
          + (ord($c{1})-128)*64 
          + (ord($c{2})-128);
  if (ord($c{0})>=240 && ord($c{0})<=247) 
    $nOrd = (ord($c{0})-240)*262144 
          + (ord($c{1})-128)*4096 
          + (ord($c{2})-128)*64 
          + (ord($c{3})-128);
  if (ord($c{0})>=248 && ord($c{0})<=251) 
    $nOrd = (ord($c{0})-248)*16777216 
          + (ord($c{1})-128)*262144 
          + (ord($c{2})-128)*4096 
          + (ord($c{3})-128)*64 
          + (ord($c{4})-128);
  if (ord($c{0})>=252 && ord($c{0})<=253) 
    $nOrd = (ord($c{0})-252)*1073741824 
          + (ord($c{1})-128)*16777216 
          + (ord($c{2})-128)*262144 
          + (ord($c{3})-128)*4096 
          + (ord($c{4})-128)*64 
          + (ord($c{5})-128);
  if (ord($c{0})>=254 && ord($c{0})<=255) 
    $nOrd = false; //error
  return $nOrd;
}

function mapUTF8Chars ($strIn, $aMappingTable) {
  $strOut = "";
  $iPos = 0;
  $nLen = strlen ($strIn);  
  while ($iPos < $nLen) {
    $bAscii = true;
    $nAscii = ord (substr ($strIn, $iPos, 1));
    if (($nAscii >= 240) && ($nAsci <= 255)) {
      // 4 chars representing one unicode character
      $strChar = substr ($strIn, $iPos, 4);
      $nOrd = unicodeToOrdinal($strChar);
      $bAscii = false;
      $iPos += 4;
    }
    else if (($nAscii >= 224) && ($nAscii <= 239)) {
      // 3 chars representing one unicode character
      $strChar = substr ($source, $iPos, 3);
      $nOrd = unicodeToOrdinal($strChar);
      $bAscii = false;
      $iPos += 3;
    }
    else if (($nAscii >= 192) && ($nAscii <= 223)) {
      // 2 chars representing one unicode character
      $strChar = substr ($strIn, $iPos, 2);
      $nOrd = unicodeToOrdinal($strChar);
      $bAscii = false;
      $iPos += 2;
    }
    else {
      // 1 char (lower ascii)
      $strChar = substr ($strIn, $iPos, 1);
      $nOrd = ord($strChar);
      $iPos += 1; 
    }
    if ($bAscii) {
      $strOut .= $strChar;
    }
    else {  
      $strMappedChar = $aMappingTable[$nOrd];
      if ($strMappedChar != "") {
        $strOut .= $strMappedChar;
      }
      else {
        $strOut .= $strChar;
      }   
    }
  }
  return $strOut;
}

The mapping table format is as follows (based on a subset of the "detoning" table):
$aPinyinMapping = array (
  "97"  => "a",
  "257" => "a",
  "225" => "a",
  "462" => "a",
  "224" => "a");

In addition to the specific operation implemented, the function also provides the basic logic/approach needed to process UTF-8 encoded Unicode.

1 Comments:

Blogger Arto Bendiken said...

For a somewhat simpler approach, see the widely-used unaccent() function.

12:28 AM  

Post a Comment

<< Home