0

apologies if this is a really stupid question or already asked elsewhere. I'm reading in some JSON and using decode_json on it, then extracting text from it and outputting that to a file.

My problem is that Unicode characters are encoded as eg \u2019 in the JSON, decode_json appears to convert this to \x{2019}. When I grab this text and output to a UTF8-encoded file, it appears as garbage.

Sample code:

use warnings;
use strict;
use JSON qw( decode_json );
use Data::Dumper;

open IN, $file or die;
binmode IN, ":utf8";
my $data = <IN>;
my $json = decode_json( $data );
open OUT, ">$outfile" or die;
binmode OUT, ":utf8";
binmode STDOUT, ":utf8";
foreach my $textdat (@{ $json->{'results'} }) {
    print STDOUT Dumper($textdat);
    my $text = $textdat->{'text'};
    print OUT "$text\n";
}

The Dumper output shows that the \u encoding has been converted to \x encoding. What am I doing wrong?

2 Answers 2

2

decode_json needs UTF-8 encoded input, so use from_json instead that accepts unicode:

my $json = from_json($data);

Another option would be to encode the data yourself:

use Encode;

my $encoded_data = encode('UTF-8', $data);
...
my $json = decode_json($data);

But it makes little sense to encode data just to decode it.

Sign up to request clarification or add additional context in comments.

2 Comments

The first option didn't fix it, but from_json did - thank you!
@DomGlennon: Oh, I forgot to include use Encode, sorry.
2

decode_json expects UTF-8, but you're passing decoded text (Unicode Code Points) instead.

So, you could remove the existing character decoding.

use feature qw( say );
use open 'std', ':encoding(UTF-8)';
use JSON qw( decode_json );

my $json_utf8 = do {
   open(my $fh, '<:raw', $in_qfn)
      or die("Can't open \"$in_qfn\": $!\n");

   local $/;
   <$fh>;
 };

my $data = decode_json($json_utf8);

{
   open(my $fh, '>', $out_qfn)
      or die("Can't create \"$out_qfn\": $!\n");

   for my $result (@{ $data->{results} }) {
      say $fh $result->{text};
   }
}

Or, you could use from_json (or JSON->new->decode) instead of decode_json.

use feature qw( say );
use open 'std', ':encoding(UTF-8)';
use JSON qw( from_json );                         # <---

my $json_ucp = do {
   open(my $fh, '<', $in_qfn)                     # <---
      or die("Can't open \"$in_qfn\": $!\n");

   local $/;
   <$fh>;
 };

my $data = from_json($json_ucp);                  # <---

{
   open(my $fh, '>', $out_qfn)
      or die("Can't create \"$out_qfn\": $!\n");

   for my $result (@{ $data->{results} }) {
      say $fh $result->{text};
   }
}

The arrows point to the three minor differences between the two snippets.


I made a number of cleanups.

  • Missing local $/; in case there are line breaks in the JSON.
  • Don't use 2-arg open.
  • Don't needlessly use global variables.
  • Use better names for variables. $data and $json were notably reversed, and $file didn't contain a file.
  • Limit the scope of your variables, especially if they use up system resources (e.g. file handles).
  • Use :encoding(UTF-8) (the standard encoding) instead of :encoding(utf8) (an encoding only used by Perl). :utf8 is even worse as it uses the internal encoding rather than the standard one, and it can lead to corrupt scalars if provided bad input.
  • Get rid of the noisy quotes around identifiers used as hash keys.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.