Python - inferring input file encoding while reading

Question

I have to process an input text file, which can be in ANSI and convert it to UTF8, whilst doing doing some processing of the lines read. In python, that'll amount to

with open(input_file_location, 'r', newline='\r\n', encoding='cp1252') as old, open(output_file_location, 'w', encoding='utf_8') as new:
  for line in old:
    modified = ... do processing here ....
    new.write(modified)

However, this will work as expected only if the input file is ANSI (windows). If however, the input file was UTF8 originally, the above code works silently, reading it assuming ANSI and thus things in output are not as expected.

So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.

Thanks

Yes, all files are generated on Windows machine

sppc42
– sppc42

2019-11-27 06:45:44 +00:00
Commented Nov 27, 2019 at 6:45 — sppc42
– sppc42, Commented Nov 27, 2019 at 6:45

Masklinn · Accepted Answer · 2019-11-27 07:32:12Z

1

So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.

UTF8 is more constraining than CP1252, and both are ascii compatible. So you can start by reading it as UTF8, if that works you're fine (it's either plain ASCII or valid UTF-8), if that does not fall back to CP1252.

Alternatively you could try running chardet on it, but that's not necessarily more reliable: every byte is "valid" in ISO-8859 encodings (of which CP1252 is a derivative), so every file "decodes properly", they just return garbage.

answered Nov 27, 2019 at 7:32

Masklinn

43.7k4 gold badges58 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

snakecharmerb · Accepted Answer · 2019-11-27 07:40:31Z

1

There isn't a guaranteed way to determine the encoding a file if it isn't known in advance. However if you are sure that the possibilities are restricted to UTF-8 and cp1252, then the following approach may work:

Open the file in binary mode and read the first three bytes. If these bytes are b'\xef\xbb\xbf' then the encoding is extremely likely to be 'utf-8-sig', a Microsoft variant of UTF-8 (unless you have cp1252 files that legitimately begin with "'ï»¿'"). See the final paragraph of this section of the codecs docs.
Assume UTF-8. Both UTF-8 and cp1252 will decode bytes in the ASCII range (0-127) identically. Single bytes with the high bit set are not valid UTF-8, so if the file is encoded as cp1252 and contains such bytes a UnicodeDecodeError will be raised.
Catch the above UnicodeDecodeError and try again with cp1252.

edited Nov 27, 2019 at 7:40

answered Nov 27, 2019 at 7:20

snakecharmerb

57.1k13 gold badges136 silver badges200 bronze badges

4 Comments

deceze Over a year ago

One has to point out that a file with b'\xef\xbb\xbf' at the start could still be ANSI which just happens to have those three characters at the start… ;)

snakecharmerb Over a year ago

@deceze this is true :-) Amended accordingly.

sppc42 Over a year ago

@snakecharmerb - the Unicode BOM is what you are mentioning by reading the first 3 bytes. That's no longer a mandatory/recommended thing and most UTF8 text won't have it any more, as that was added mostly to distinguish b/w big and little endian encodings for Unicode.

snakecharmerb Over a year ago

@sppc42 the UTF-16/32 BOM is two bytes and denotes endianness as you say. The UTF-8 "BOM" is three bytes and is anencoding marker used by Microsoft applications; it does not relate to endianness. See the doc I linked to in the answer.

Collectives™ on Stack Overflow

Python - inferring input file encoding while reading

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related