Unicode URL Escapes in PHP
(The wirdz online dictionary can be found at wirdz.com.)
The standard method in for getting the value of a URL parameter is to use the $_GET array which is automatically created.
Unfortunately, the process which generates the array assumes that the URL only contains ASCII and escaped ASCII values. i.e. it doesn't handle Unicode escapes properly. So you need to get the value directly from the query string.
The following function returns a query string value (the alternative is to create an array first - more efficient for a long query string with more parameters):
The following function pair can be used then the convert the parameter value to Unicode (note in order for the Unicode values to display correctly on a generated HMTL page - the meta command <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> needs to be included in the page.):
This function was adapted from previously published code. It will handle either Unicode or ASCII escapes correctly.
The standard method in for getting the value of a URL parameter is to use the $_GET array which is automatically created.
Unfortunately, the process which generates the array assumes that the URL only contains ASCII and escaped ASCII values. i.e. it doesn't handle Unicode escapes properly. So you need to get the value directly from the query string.
The following function returns a query string value (the alternative is to create an array first - more efficient for a long query string with more parameters):
function getQueryParameter ($strParam) { $strQueryString = $_SERVER['QUERY_STRING']; $aParamList = explode('&',$strQueryString); $i = 0; while ($i < count($aParamList)) { $aParam = split('=', $aParamList[$i]); if ($strParam == $aParam[0]) { return $aParam[1]; } $i++; } return ""; }
The following function pair can be used then the convert the parameter value to Unicode (note in order for the Unicode values to display correctly on a generated HMTL page - the meta command <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> needs to be included in the page.):
function code2utf($num){ if($num<128 br=""> return chr($num); if($num<2048 br=""> return chr(($num>>6)+192) .chr(($num&63)+128); if($num<65536 br=""> return chr(($num>>12)+224) .chr((($num>>6)&63)+128) .chr(($num&63)+128); if($num<2097152 br=""> return chr(($num>>18)+240) .chr((($num>>12)&63)+128) .chr((($num>>6)&63)+128) .chr(($num&63)+128); return ''; } function unescape($strIn, $iconv_to = 'UTF-8') { $strOut = ''; $iPos = 0; $len = strlen ($strIn); while ($iPos < $len) { $charAt = substr ($strIn, $iPos, 1); if ($charAt == '%') { $iPos++; $charAt = substr ($strIn, $iPos, 1); if ($charAt == 'u') { // Unicode character $iPos++; $unicodeHexVal = substr ($strIn, $iPos, 4); $unicode = hexdec ($unicodeHexVal); $strOut .= code2utf($unicode); $iPos += 4; } else { // Escaped ascii character $hexVal = substr ($strIn, $iPos, 2); if (hexdec($hexVal) > 127) { // Convert to Unicode $strOut .= code2utf(hexdec ($hexVal)); } else { $strOut .= chr (hexdec ($hexVal)); } $iPos += 2; } } else { $strOut .= $charAt; $iPos++; } } if ($iconv_to != "UTF-8") { $strOut = iconv("UTF-8", $iconv_to, $strOut); } return $strOut; }2097152>65536>2048>128>
This function was adapted from previously published code. It will handle either Unicode or ASCII escapes correctly.
7 Comments:
Oh whoops - I'll fix it!
I've tested your code and the following modification is needed.
In code2utf, the numbers are not correct. Replace at least 1024 by 2048, as specified in the utf8_encode comments at http://il.php.net/utf8_encode.
Their version of code2php is probably better.
function code2utf($num)
{
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
OK - I'll check your comments out and make changes as required. I didn't write code2utf myself and have had not had any issues but Unicode is a big space.
Long time before replying but thank you Amnon - the issue with code2utf duly turned up in passing Chinese characters via a query string - swaping in the version you listed fixed the issue. The main blog article has been updated.
This works great on GET forms, but how do you access the string in POST forms ?
I tried to access $GLOBALS['HTTP_RAW_POST_DATA'] but it's empty. Any idea why ?
Thanks !
Marcel.
I am not sure if this is an optimal solution, but I use the following, taking advantage of the fact that PHP does always an initial uncoding into ISO-8859-1. I observed that unicode characters were left unchanged, for instance:
The POST string as seen on the HTTP dialogue is:
toto=Et%20%E9t%u0155%20%E0%201%u20AC
The PHP $_POST['toto'] is:
$_POST['toto']="ét%u0155 a 1%u20AC"
The result of the following trick leads to the correct result:
unescape(iconv("ISO-8859-1","UTF-8", $_POST['toto']))="Et étŕ à 1€"
The only thing that I am afraid of is that PHP consider header information to do another decoding into something else than ISO-8859-1 to start with. For instance, does it consider the charset provided in the header Content-Type:
Content-Type: application-x-form-urlencoded; charset=utf-8
Anybody knows ?
Marcel.
(follow-up from previous message)
For those wondering why I can't access HTTP_RAW_POST_DATA, it seems to have something to do with a config value in php.ini.
I now use
file_get_contents('php://input');
which gives me the same contents without modifying php.ini.
Post a Comment
<< Home