Convert Unicode from JSON string with PHP

Question

I've been reading up on a few solutions but have not managed to get anything to work as yet.

I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.

I'd like to use PHP to convert these into either £ or £.

I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:

$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');

The output is Â£.

Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?

UPDATE

It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:

That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)

It sounds like the encoding is broken at the other end of the API. Â£ is what you typically get if you take UTF-8 encoded data and read it as ISO-8859-1. I guess that is happening somewhere in the API provider's system before the resulting string is then JSON encoded. A bit of a mess, really. The first port of call should be to notify the API provider and ask them to fix it. — SDC
– SDC, Commented Jan 25, 2013 at 17:29
Thanks SDC. I dropped them an email to say just that. Hopefully it will be updated soon, but perhaps that is wishful thinking! — Alexander Holsgrove
– Alexander Holsgrove, Commented Jan 25, 2013 at 22:45

SirDarius · Accepted Answer · 2013-01-28 13:44:07Z

11

It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the Â£ string.

What you should have is \u00a3 which is the unicode code point for £.

{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.

If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.

function fixBadUnicode($str) {
    return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}

Example here: http://phpfiddle.org/main/code/6sq-rkn

Edit:

If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:

function fixBadUnicodeForJson($str) {
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
    $str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
    return $str;
}

Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.

Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.

edited Jan 28, 2013 at 13:44

answered Jan 25, 2013 at 14:43

SirDarius

43.2k8 gold badges92 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Alexander Holsgrove Over a year ago

Thanks for this. Can I run that on the entire string before|after calling json_decode to save calling 'fixBadUnicode' multiple times.

SirDarius Over a year ago

you can run it before json_decode, however be careful that this might lead your json string to contain illegal characters, see json.org for the list of characters that can exist in json strings.

Alexander Holsgrove Over a year ago

If I run it on the raw JSON, it converts the '\u00c2\u00a3' to '�'. I also found \u0099 is left unchanged - I think this is an apostrophe. Seems like a really poor JSON data feed!

Alexander Holsgrove Over a year ago

That's great - thank you. I don't need the encoded JSON after it has been 'fixed' as I need to iterate through the data. Can I instead call json_decode and then preg_replace(...) without needing to call json_encode and the substr?

Hossein J Over a year ago

preg_replace "e" is deprecated, can you write this in the format of "preg_replace_callback" ?

|

Yo-han · Accepted Answer · 2013-01-25 14:41:19Z

3

The output is correct.

\u00c2 == Â
\u00a3 == £

So nothing is wrong here. And converting to HTML entities is easy:

htmlentities($title);

answered Jan 25, 2013 at 14:41

Yo-han

3512 silver badges12 bronze badges

3 Comments

Alexander Holsgrove Over a year ago

The first part is correct, but htmlentities($title) gives me Ã�Â£

SirDarius Over a year ago

the ouput is correct, but it is obvious that the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point.

Alexander Holsgrove Over a year ago

Just for reference, the JSON is from the Hot UK Deals API. I didn't want to mess about with the default XML feed type

Yann Rimbaud · Accepted Answer · 2018-08-03 16:56:13Z

Here is an updated version of the function using preg_replace_callback instead of preg_replace.

function fixBadUnicodeForJson($str) {
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
    $str
);
    $str = preg_replace_callback(
    '/\\\\u00([0-9a-f]{2})/',
    function($matches) { return chr(hexdec("$1")); },
    $str
);
    return $str;
}

Collectives™ on Stack Overflow

Convert Unicode from JSON string with PHP

3 Answers 3

9 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related