Problem with decoding unicode JSON in perl

Question

I experience a strange behavior in Perl while trying to decode a Unicode JSON string coming from a PHP script's json_encode function. I simplified the problem to next code:

#!/usr/bin/perl
use CGI;
use JSON;
print CGI::header(-type=>'text/html', -charset=>'UTF-8');

print %{ decode_json('{"test_1" : "= \u00F9 ="}') }->{'test_1'};
print '<br>';
print %{ decode_json('{"test_2" : "= \u00F9 \u0121 ="}') }->{'test_2'};

When I run this script in browser I see next:

= � =
= ù ġ =

The first line contains a "broken character", the second is correct. What I think is happenning is that for some reason Perl decodes first string in ISO-8859-1 encoding, if I change page encoding to ISO-8859-1 the first line is correct, however the second is broken.

My Perl version is 5.10.1 and the JSON version is 2.51.

Question: how to force Perl json_decode to return UTF-8 characters in the first print?

Note: I can fix the problem by manually converting first output to UTF-8, but this requires the installation of an additional "Encoder" module, which I want to avoid.

The Encode module comes with Perl since v5.7.3.

daxim
– daxim

2011-04-05 22:30:30 +00:00
Commented Apr 5, 2011 at 22:30 — daxim
– daxim, Commented Apr 5, 2011 at 22:30

Øyvind Skaar · Accepted Answer · 2011-04-04 11:15:35Z

4

Tried your code and it generated several warnings with "use warnings;"

If you want to be sure to get utf8 I believe you have to tell Perl so. Use "binmode(STDOUT, ":utf8");" or similar.

This works on the command-line:

use strict;
use warnings;
use JSON;

binmode(STDOUT, ":utf8");

print decode_json('{"test_1" : "= \u00F9 ="}')->{test_1};
print '<br>';
print decode_json('{"test_2" : "= \u00F9 \u0121 ="}')->{'test_2'};

EDIT: AFAIK, this does not affect decode_json(), but the output from the perl script itself. Unicode tutorials often tell you to explicitly state what encoding you want on your input & output (filehandlers)

edited Apr 4, 2011 at 11:15

answered Apr 4, 2011 at 11:06

Øyvind Skaar

2,33815 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

braz Over a year ago

however it is strange that perl can't decode "\u..." character in utf-8 by default

Øyvind Skaar Over a year ago

No, it's not that.. read joelonsoftware.com/articles/Unicode.html , then perldoc.perl.org/perlunitut.html and then take a look at perldoc.perl.org/perlunifaq.html

Øyvind Skaar Over a year ago

from the faq: "The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead."

braz Over a year ago

I've read the first article, it is very nice. Just to make sure that I understood everything correctly, is my explanation below correct:

braz Over a year ago

in first string of my example perl sees character which can be converted in IS0-8859-1 so it does so and because my page encoding is UTF-8 the character looks broken, when perl meets second string it sees the second character \u0121 which can't be converted to iso-8859-1 and perl drops a warning and converts the whoe string to UTF-8 ?

Collectives™ on Stack Overflow

Problem with decoding unicode JSON in perl

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related