0

Context:
I have to migrate a Perl script, into Python. The problem resides in that the configuration files that this Perl script uses, is actually valid Perl code. My Python version of it, uses .yaml files as config.

Therefore, I basically had to write a converter between Perl and yaml. Given that, from what I found, Perl does not play well with Yaml, but there are libs that allow dumping Perl hashes into JSON, and that Python works with JSON -almost- natively, I used this format as an intermediate: Perl -> JSON -> Yaml. The first conversion is done in Perl code, and the second one, in Python code (which also does some mangling on the data).

Using the library mentioned by @simbabque, I can output YAML natively, which afterwards I must modify and play with. As I know next to nothing of Perl, I prefer to do so in Python.

Problem:
The source config files look something like this:

$sites = {
    "0100101001" => {
        mail => 1,
        from => '[email protected]',
        to => '[email protected]',
        subject => 'á é í ó ú',
        msg => 'á é í ó ú',
        ftp => 0,
        sftp => 0,
    },
    "22222222" => {
[...]

And many more of those.

My "parsing" code is the following:

use strict;
use warnings;

# use JSON;
use YAML;
use utf8;
use Encode;
use Getopt::Long;

my $conf;
GetOptions('conf=s' => \$conf) or die;
our (
    $sites
);
do $conf;

# my $json = encode_json($sites);
my $yaml = Dump($sites);

binmode(STDOUT, ':encoding(utf8)');
# print($json);
print($yaml);

Nothing out of the ordinary. I simply need the JSON YAML version of the Perl data. In fact, it mostly works. My problem is with the encoding.

The output of the above code is this:

  [...snip...]
  mail: 1
  msg: á é í ó ú
  sftp: 0
  subject: á é í ó ú
  [...snip...]

The encoding goes to hell and back. As far as I read, UTF-8 is the default, and just in case, I force it with binmode, but to no avail.

What am I missing here? Any workaround?

Note: I thought I may have been my shell, but locale outputs this:

❯ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

Which seems ok.

Note 2: I know next to nothing of Perl, and is not my intent to be an expert on it, so any enhancements/tips are greatly appreciated too.

Note 3: I read this answer, and my code is loosely based on it. The main difference is that I'm not sure how to encode a file, instead of a simple string.

11
  • The converter for converting Perl data structures to YAML is one of the many YAML modules on CPAN. Take your pick. No need to reinvent the wheel or even do this by hand. Commented Jan 12, 2018 at 16:37
  • Why didn't I find that module before... Also, didn't know about metacpan. Thanks for the tip. Anyhow, my problem is that I have to manipulate the data a bit, and I'd take days to do so in Perl, considering how little I know of it. In Python I did it in a few minutes. However, I like this lib, and maybe I could save a conversion. Thanks! Commented Jan 12, 2018 at 16:39
  • 2
    And in regards to Perl does not play well with YAML, you do know that one of the three core people behind YAML is actually a big name in the Perl community? There are actually a few people from the Perl world involved with YAML. :) There is also a trove of stuff about Perl and YAML in tinita's blog on blogs.perl.org. Commented Jan 12, 2018 at 16:41
  • 1
    Looks like mob beat me. That's ok though. They need the points to get to the t-shirt, and I already have two! :) Commented Jan 12, 2018 at 17:05
  • 1
    @Borodin I got them both from the 10m questions thing. One for writing a post about a special answer on SO, and one for apparently someone writing one about me, though I never found it. I also have a mug and a load of stickers. Commented Jan 12, 2018 at 18:01

1 Answer 1

4

The sites config file is UTF-8 encoded. Here are three workarounds:

  1. Put use utf8 pragma inside the site configuration file. The use utf8 pragma in the main script is not sufficient to treat files included with do/require as UTF-8 encoded.

  2. If that is not feasible, decode the input before you pass it to the JSON encoder. Something like

    open CFG, "<:encoding(utf-8)", $conf;
    do { local $/; eval <CFG> };
    close CFG;
    

instead of

do $conf
  1. Use JSON::to_json instead of JSON::encode_json. encode_json expects decoded input (Unicode code points) and the output is UTF-8 encoded. The output of to_json is not encoded, or rather, it will have the same encoding as the input, which is what you want.

There is no need to encode the final output as UTF-8. Using any of the three workarounds will already produce UTF-8 encoded output.

Sign up to request clarification or add additional context in comments.

6 Comments

Option 2 gives me the creeps.
In the question's comments, @simbabque mentioned the YAML library, and I think I'll go straight to YAML. With that, the option 2 looks good (even considering the creeps). However, doesn't seem to load the file correctly, as the sites variable is still uninitialized (the resulting yaml is empty). Any idea why that might be?
Option 2 is exactly as creepy as do $conf
In hindsight, I feel like most of the stuff I'm doing here is creepy.
I have a feeling that I'm going to be embarrassed here, but why do you have both do and eval? I can't see why { local $/; eval <CFG>; }wouldn't do everything you need when you're discarding the result of do.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.