3

I pass a utf8 encoded string from my command line into a Perl program:

> ./test.pl --string='ḷet ūs try ṭhiñgs'

which seems to recognize the string correctly:

use utf8;  

GetOptions(                                                                     
    'string=s' => \$string,                                                     
) or die;                                                                    
print Dumper($string);
print Dumper(utf8::is_utf8($string));                                           
print Dumper(utf8::valid($string));                                             

prints

$VAR1 = 'ḷet ūs try ṭhiñgs';
$VAR1 = '';
$VAR1 = 1;

When I store this string into a hash and call encode_json on it, the string seems to be again encoded whereas to_json seems to work (if I read the output correctly):

my %a = ( 'nāme' => $string ); # Note the Unicode character                                                 
print Dumper(\%a);
print Dumper(encode_json(\%a));                                                 
print Dumper(to_json(\%a));                                                     

prints

$VAR1 = {
          "n\x{101}me" => 'ḷet ūs try ṭhiñgs'
        };
$VAR1 = '{"nāme":"ḷet Å«s try á¹­hiñgs"}';
$VAR1 = "{\"n\x{101}me\":\"\x{e1}\x{b8}\x{b7}et \x{c5}\x{ab}s try \x{e1}\x{b9}\x{ad}hi\x{c3}\x{b1}gs\"}";

Turning this back into the original hash, however, doesn't seem to work with either methods and in both cases hash and string and broken:

print Dumper(decode_json(encode_json(\%a)));                                    
print Dumper(from_json(to_json(\%a)));    

prints

$VAR1 = {
           "n\x{101}me" => "\x{e1}\x{b8}\x{b7}et \x{c5}\x{ab}s try \x{e1}\x{b9}\x{ad}hi\x{c3}\x{b1}gs"
        };
$VAR1 = {
          "n\x{101}me" => "\x{e1}\x{b8}\x{b7}et \x{c5}\x{ab}s try \x{e1}\x{b9}\x{ad}hi\x{c3}\x{b1}gs"
        };

A hash lookup $a{'nāme'} now fails.

Question: How do I handle utf8 encoding and strings and JSON encode/decode correctly in Perl?

5
  • It's obvious from your very first print Dumper(utf8::is_utf8($string)); returning '' that the string is not recognised as UTF-8. Commented Feb 19, 2016 at 0:06
  • ...but the utf8::valid($string) returns true. Commented Feb 19, 2016 at 0:26
  • @Jens That doesn't mean what you think it means. Commented Feb 19, 2016 at 0:39
  • @MattJacob: uhm... ok? :-) Commented Feb 19, 2016 at 0:49
  • @Jens In this case, "valid" means "consistent". It's marked "INTERNAL" for a reason. Just... don't use it. Commented Feb 19, 2016 at 0:51

1 Answer 1

4

You need to decode your input:

use Encode;

my $string;
GetOptions('string=s' => \$string) or die;
$string = decode('UTF-8', $string);

Putting it all together, we get:

use strict;
use warnings;
use 5.012;
use utf8;

use Encode;
use Getopt::Long;
use JSON;

my $string;
GetOptions('string=s' => \$string) or die;
$string = decode('UTF-8', $string);

my %hash = ('nāme' => $string);
my $json = encode_json(\%hash);
my $href = decode_json($json);

binmode(STDOUT, ':encoding(utf8)');
say $href->{nāme};

Example:

$ perl test.pl --string='ḷet ūs try ṭhiñgs'
ḷet ūs try ṭhiñgs

Make sure your source file is actually encoded as UTF-8!

Sign up to request clarification or add additional context in comments.

7 Comments

That doesn't address the hash after the from_json, does it?
@Jens Don't use to_json/from_json. Use encode_json/decode_json instead to maintain UTF-8.
So how would I handle the hash lookup at the end of my question correctly then?
@Jens If you use encode_json on a hash reference, you'll get a string that represents a JSON object. If you use decode_json on a string containing a JSON object, you'll get a hash reference. So, $href = decode_json('{...}'); say $href->{'nāme'};
Thanks! I see that you also added binmode.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.