2

I experience a strange behavior in Perl while trying to decode a Unicode JSON string coming from a PHP script's json_encode function. I simplified the problem to next code:

#!/usr/bin/perl
use CGI;
use JSON;
print CGI::header(-type=>'text/html', -charset=>'UTF-8');

print %{ decode_json('{"test_1" : "= \u00F9 ="}') }->{'test_1'};
print '<br>';
print %{ decode_json('{"test_2" : "= \u00F9 \u0121 ="}') }->{'test_2'};

When I run this script in browser I see next:

= � =
= ù ġ =

The first line contains a "broken character", the second is correct. What I think is happenning is that for some reason Perl decodes first string in ISO-8859-1 encoding, if I change page encoding to ISO-8859-1 the first line is correct, however the second is broken.

My Perl version is 5.10.1 and the JSON version is 2.51.

Question: how to force Perl json_decode to return UTF-8 characters in the first print?

Note: I can fix the problem by manually converting first output to UTF-8, but this requires the installation of an additional "Encoder" module, which I want to avoid.

1
  • The Encode module comes with Perl since v5.7.3. Commented Apr 5, 2011 at 22:30

1 Answer 1

4

Tried your code and it generated several warnings with "use warnings;"

If you want to be sure to get utf8 I believe you have to tell Perl so. Use "binmode(STDOUT, ":utf8");" or similar.

This works on the command-line:

use strict;
use warnings;
use JSON;

binmode(STDOUT, ":utf8");

print decode_json('{"test_1" : "= \u00F9 ="}')->{test_1};
print '<br>';
print decode_json('{"test_2" : "= \u00F9 \u0121 ="}')->{'test_2'};

EDIT: AFAIK, this does not affect decode_json(), but the output from the perl script itself. Unicode tutorials often tell you to explicitly state what encoding you want on your input & output (filehandlers)

Sign up to request clarification or add additional context in comments.

5 Comments

however it is strange that perl can't decode "\u..." character in utf-8 by default
from the faq: "The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead."
I've read the first article, it is very nice. Just to make sure that I understood everything correctly, is my explanation below correct:
in first string of my example perl sees character which can be converted in IS0-8859-1 so it does so and because my page encoding is UTF-8 the character looks broken, when perl meets second string it sees the second character \u0121 which can't be converted to iso-8859-1 and perl drops a warning and converts the whoe string to UTF-8 ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.