2

I have a piece of code:

with open('filename.txt','r') as textfile:
    kwList = [x.strip('\n') for x in textfile.readlines()]

I get a: UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128) on line 2

The problem is that according the python docs : https://docs.python.org/3/library/functions.html#open

Python3 uses locale.getpreferredencoding(False) to get the default encoding to use when there is no encoding specified in the open method.

When I run locale.getpreferredencoding(False), I get 'UTF-8'.

Why do I get 'ascii' codec failed in the UnicodeDecodeError when Python should use 'utf-8' to do this?

13
  • 2
    The locale depends on the context you are running your script in. Run the locale.getpreferredencoding(False) command in the same context. Commented May 11, 2016 at 12:08
  • 1
    Is the UTF-8 preferred encoding being given in the same run of the same code (e. g. you added a print(locale.getpreferredencoding(False)) directly above your with open(...) as textfile or via some other means? Commented May 11, 2016 at 12:08
  • 2
    And why not simply set the encoding argument to the open() call? Commented May 11, 2016 at 12:08
  • @MartijnPieters, I can pass the encoding to the open() call and I have, this is just out of curiosity. On production servers I face this problem. Commented May 11, 2016 at 12:10
  • 1
    @ChintanShah: your production code may use the same user, but that doesn't mean that that code uses the same locale. If you are running this on a POSIX system (Mac, Linux, etc.) then the encoding is taken from the LC_CTYPE environment variable, which if not set explicitly is derived from LC_ALL or LANG. So if you production code is run with LANG=C or LC_ALL=C, then the default C locale is used which uses ASCII as the encoding. Commented May 11, 2016 at 12:20

1 Answer 1

2

The locale is taken from the context; on POSIX systems, that means the environment variables, see the POSIX locale documentation. You'll need to reproduce the exact context of your production environment if you want to test what encoding Python will decide on (e.g. copy the environment variables used by the production environment too).

You are probably running your program as a subprocess of something that only sets (or inherits) the effective user, but does not copy the environment for that user. Either an explicit locale has been set by that parent process or, if none is set, the default C locale is used. The default encoding for that locale is ASCII; some systems will report this by the name ANSI_X3.4-1968:

$ LANG=C python -c 'import locale; print(locale.getpreferredencoding(False))'
ANSI_X3.4-1968

If, for example, your production code is run from cron, then the environment variables are not set when you set a specific user. Set LC_ALL environment variable explicitly at the top of your crontab:

LC_ALL=en.UTF-8

if your cron implementation supports setting variables this way, or set it on the command line you are going to run:

* * * * *    LC_ALL=nb_NO.UTF-8 /path/to/your/program

See Where can I set environment variables that crontab will use?

Sign up to request clarification or add additional context in comments.

2 Comments

Any idea what might be the reason for getting ANSI_X3.4-1968 from LC_ALL=en_US.utf8 python -c 'import locale; print locale.getpreferredencoding(False)' while locale -a returns (amount other results) en_US.utf8?
@PiotrDobrogost: This can depend on your OS too. I also find that different Python versions are being difficult about the spelling; on Python 3.6, using UTF-8 works (so LC_ALL=en_US.UTF-8). I'm looking into this some more now, but it is not quite operating the way I expected it to on my Mac either.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.