6

I've been reading up on a few solutions but have not managed to get anything to work as yet.

I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.

I'd like to use PHP to convert these into either £ or £.

I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:

$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');

The output is £.

Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?

UPDATE

It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:

That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)
2
  • 2
    It sounds like the encoding is broken at the other end of the API. £ is what you typically get if you take UTF-8 encoded data and read it as ISO-8859-1. I guess that is happening somewhere in the API provider's system before the resulting string is then JSON encoded. A bit of a mess, really. The first port of call should be to notify the API provider and ask them to fix it. Commented Jan 25, 2013 at 17:29
  • Thanks SDC. I dropped them an email to say just that. Hopefully it will be updated soon, but perhaps that is wishful thinking! Commented Jan 25, 2013 at 22:45

3 Answers 3

11

It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.

What you should have is \u00a3 which is the unicode code point for £.

{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.

If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.

function fixBadUnicode($str) {
    return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}

Example here: http://phpfiddle.org/main/code/6sq-rkn

Edit:

If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:

function fixBadUnicodeForJson($str) {
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
    return $str;
}

Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.

Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.

Sign up to request clarification or add additional context in comments.

9 Comments

Thanks for this. Can I run that on the entire string before|after calling json_decode to save calling 'fixBadUnicode' multiple times.
you can run it before json_decode, however be careful that this might lead your json string to contain illegal characters, see json.org for the list of characters that can exist in json strings.
If I run it on the raw JSON, it converts the '\u00c2\u00a3' to '�'. I also found \u0099 is left unchanged - I think this is an apostrophe. Seems like a really poor JSON data feed!
That's great - thank you. I don't need the encoded JSON after it has been 'fixed' as I need to iterate through the data. Can I instead call json_decode and then preg_replace(...) without needing to call json_encode and the substr?
preg_replace "e" is deprecated, can you write this in the format of "preg_replace_callback" ?
|
3

The output is correct.

\u00c2 == Â
\u00a3 == £

So nothing is wrong here. And converting to HTML entities is easy:

htmlentities($title);

3 Comments

The first part is correct, but htmlentities($title) gives me �£
the ouput is correct, but it is obvious that the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point.
Just for reference, the JSON is from the Hot UK Deals API. I didn't want to mess about with the default XML feed type
3

Here is an updated version of the function using preg_replace_callback instead of preg_replace.

function fixBadUnicodeForJson($str) {
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")); },
    $str
);
    return $str;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.