Python uses 'ascii' codec in decoding where it should use 'UTF-8'

Question

I have a piece of code:

with open('filename.txt','r') as textfile:
    kwList = [x.strip('\n') for x in textfile.readlines()]

I get a: UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 5595: ordinal not in range(128) on line 2

The problem is that according the python docs : https://docs.python.org/3/library/functions.html#open

Python3 uses locale.getpreferredencoding(False) to get the default encoding to use when there is no encoding specified in the open method.

When I run locale.getpreferredencoding(False), I get 'UTF-8'.

Why do I get 'ascii' codec failed in the UnicodeDecodeError when Python should use 'utf-8' to do this?

The locale depends on the context you are running your script in. Run the locale.getpreferredencoding(False) command in the same context. — Martijn Pieters
– Martijn Pieters, Commented May 11, 2016 at 12:08
Is the UTF-8 preferred encoding being given in the same run of the same code (e. g. you added a print(locale.getpreferredencoding(False)) directly above your with open(...) as textfile or via some other means? — Sean Vieira
– Sean Vieira, Commented May 11, 2016 at 12:08
And why not simply set the encoding argument to the open() call? — Martijn Pieters
– Martijn Pieters, Commented May 11, 2016 at 12:08
@MartijnPieters, I can pass the encoding to the open() call and I have, this is just out of curiosity. On production servers I face this problem. — Chintan Shah
– Chintan Shah, Commented May 11, 2016 at 12:10
@ChintanShah: your production code may use the same user, but that doesn't mean that that code uses the same locale. If you are running this on a POSIX system (Mac, Linux, etc.) then the encoding is taken from the LC_CTYPE environment variable, which if not set explicitly is derived from LC_ALL or LANG. So if you production code is run with LANG=C or LC_ALL=C, then the default C locale is used which uses ASCII as the encoding. — Martijn Pieters
– Martijn Pieters, Commented May 11, 2016 at 12:20

Community · Accepted Answer · 2017-05-23 12:17:23Z

2

The locale is taken from the context; on POSIX systems, that means the environment variables, see the POSIX locale documentation. You'll need to reproduce the exact context of your production environment if you want to test what encoding Python will decide on (e.g. copy the environment variables used by the production environment too).

You are probably running your program as a subprocess of something that only sets (or inherits) the effective user, but does not copy the environment for that user. Either an explicit locale has been set by that parent process or, if none is set, the default C locale is used. The default encoding for that locale is ASCII; some systems will report this by the name ANSI_X3.4-1968:

$ LANG=C python -c 'import locale; print(locale.getpreferredencoding(False))'
ANSI_X3.4-1968

If, for example, your production code is run from cron, then the environment variables are not set when you set a specific user. Set LC_ALL environment variable explicitly at the top of your crontab:

LC_ALL=en.UTF-8

if your cron implementation supports setting variables this way, or set it on the command line you are going to run:

* * * * *    LC_ALL=nb_NO.UTF-8 /path/to/your/program

See Where can I set environment variables that crontab will use?

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered May 11, 2016 at 13:32

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Piotr Dobrogost Over a year ago

Any idea what might be the reason for getting ANSI_X3.4-1968 from LC_ALL=en_US.utf8 python -c 'import locale; print locale.getpreferredencoding(False)' while locale -a returns (amount other results) en_US.utf8?

Martijn Pieters Over a year ago

@PiotrDobrogost: This can depend on your OS too. I also find that different Python versions are being difficult about the spelling; on Python 3.6, using UTF-8 works (so LC_ALL=en_US.UTF-8). I'm looking into this some more now, but it is not quite operating the way I expected it to on my Mac either.

Collectives™ on Stack Overflow

Python uses 'ascii' codec in decoding where it should use 'UTF-8'

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related