0

I have to process an input text file, which can be in ANSI and convert it to UTF8, whilst doing doing some processing of the lines read. In python, that'll amount to

with open(input_file_location, 'r', newline='\r\n', encoding='cp1252') as old, open(output_file_location, 'w', encoding='utf_8') as new:
  for line in old:
    modified = ... do processing here ....
    new.write(modified)

However, this will work as expected only if the input file is ANSI (windows). If however, the input file was UTF8 originally, the above code works silently, reading it assuming ANSI and thus things in output are not as expected.

So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.

Thanks

1
  • Yes, all files are generated on Windows machine Commented Nov 27, 2019 at 6:45

2 Answers 2

1

So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.

UTF8 is more constraining than CP1252, and both are ascii compatible. So you can start by reading it as UTF8, if that works you're fine (it's either plain ASCII or valid UTF-8), if that does not fall back to CP1252.

Alternatively you could try running chardet on it, but that's not necessarily more reliable: every byte is "valid" in ISO-8859 encodings (of which CP1252 is a derivative), so every file "decodes properly", they just return garbage.

Sign up to request clarification or add additional context in comments.

Comments

1

There isn't a guaranteed way to determine the encoding a file if it isn't known in advance. However if you are sure that the possibilities are restricted to UTF-8 and cp1252, then the following approach may work:

  1. Open the file in binary mode and read the first three bytes. If these bytes are b'\xef\xbb\xbf' then the encoding is extremely likely to be 'utf-8-sig', a Microsoft variant of UTF-8 (unless you have cp1252 files that legitimately begin with "''"). See the final paragraph of this section of the codecs docs.
  2. Assume UTF-8. Both UTF-8 and cp1252 will decode bytes in the ASCII range (0-127) identically. Single bytes with the high bit set are not valid UTF-8, so if the file is encoded as cp1252 and contains such bytes a UnicodeDecodeError will be raised.
  3. Catch the above UnicodeDecodeError and try again with cp1252.

4 Comments

One has to point out that a file with b'\xef\xbb\xbf' at the start could still be ANSI which just happens to have those three characters at the start… ;)
@deceze this is true :-) Amended accordingly.
@snakecharmerb - the Unicode BOM is what you are mentioning by reading the first 3 bytes. That's no longer a mandatory/recommended thing and most UTF8 text won't have it any more, as that was added mostly to distinguish b/w big and little endian encodings for Unicode.
@sppc42 the UTF-16/32 BOM is two bytes and denotes endianness as you say. The UTF-8 "BOM" is three bytes and is anencoding marker used by Microsoft applications; it does not relate to endianness. See the doc I linked to in the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.